anhinga_anhinga: (Anhinga)
anhinga_anhinga ([personal profile] anhinga_anhinga) wrote2016-12-27 01:29 am

Dataflow matrix machines as generalized recurrent neural networks

A year ago I posted about dataflow programming and linear models of computation:

http://anhinga-anhinga.livejournal.com/82757.html

It turns out that those dataflow matrix machines are a fairly powerful generalization of recurrent neural networks.

The main feature of dataflow matrix machines (DMMs) are vector neurons. While recurrent neural networks process streams of numbers, dataflow matrix machines process streams of representations of arbitrary vectors (linear streams).

Another important feature of DMMs is that neurons of arbitrary input and output arity are allowed, and a rich set of built-in transformations of linear streams is provided.

Recurrent neural networks are Turing-complete, but they are an esoteric programming language, and not a convenient general-purpose programming platform. DMMs provide a formalism friendly to handling sparse vectors, conditionals, and more, and there are indications that DMMs will grow to become a powerful general-purpose programming platform, in addition to being a convenient machine learning platform.

In this context, it is possible to represent large classes of programs by matrices of real numbers, which allows us to modify programs in continuous fashion and to synthesize programs by synthesizing matrices of real numbers.

Further details and preprints

Self-referential mechanism: Consider a linear stream of matrices describing the connectivity pattern and weights of a DMM. Select a dedicated neuron Self emitting such a stream on its output, and use the latest value of that stream as the current network matrix (matrix describing the connectivity pattern and weights of our DMM). A typical Self neuron would work as an accumulator taking additive updates from other neurons in the network. This mechanism enables reflection facilities and powerful dynamic self-modification facilities. In particular, the networks in question have facilities for dynamic expansion.

The recent DMM-related preprints by our group:

https://arxiv.org/abs/1603.09002

https://arxiv.org/abs/1605.05296

https://arxiv.org/abs/1606.09470

https://arxiv.org/abs/1610.00831

Modern recurrent neural networks with good machine learning properties such as LSTM and Gated Recurrent Unit networks are naturally understood in the DMM framework as networks having linear and bilinear neurons in addition to neurons with more traditional sigmoid activation functions.

Our new open source effort

The new open-source implementation of core DMM primitives in Clojure:

https://github.com/jsa-aerial/DMM

This open-source implementation features a new vector space of recurrent maps (space of "mixed rank tensors"), which allows us to represent a large variety of linear streams as streams of recurrent maps. The vector space of recurrent maps also makes it possible to express variadic neurons as neurons having just one argument.

Therefore a type of neuron is simply a function transforming recurrent maps, which is a great simplification compared to the formalism presented in the preprints above. See the design notes within this open-source implementation for further details.

[identity profile] livejournal.livejournal.com 2016-12-27 07:33 am (UTC)(link)
Hello! Your entry got to top-25 of the most popular entries in LiveJournal!
Learn more about LiveJournal Ratings in FAQ (https://www.dreamwidth.org/support/faqbrowse?faqid=303).

[identity profile] anhinga-anhinga.livejournal.com 2016-12-27 03:33 pm (UTC)(link)
I wonder if this strange "top-25" marking is real :-) (It is supposed to count the number of visits, but I doubt it can distinguish genuine visits from bots, and there was quite a bit of bot activity.)

In any case, I'll be happy to post further details/comments on DMMs and/or recurrent neural nets, if anyone wants them.

And, on the other note, I'll be happy to post Clojure notes, if anyone wants to see those. If anyone is potentially interested in learning Clojure (a JVM-based modern Lisp) or in improving their Clojure skills, those notes might come handy...
Edited 2016-12-27 22:08 (UTC)

[identity profile] bvn-mai.livejournal.com 2016-12-28 07:03 am (UTC)(link)
Я думаю, что многие заходят молча :) - это не боты. Ну а что касается деталей - мне это интересно.

[identity profile] bvn-mai.livejournal.com 2016-12-28 07:19 am (UTC)(link)
Кстати, 5-ти кратное повторение моего комментария - это явное свидетельство, что это не боты а кривые руки, новых владельцев ЖЖ.

[identity profile] anhinga-anhinga.livejournal.com 2016-12-28 04:06 pm (UTC)(link)
Да, глючит ужасно. Буду потихоньку писать серию комментариев к этому посту :-)

[identity profile] datjko.livejournal.com 2016-12-28 07:47 am (UTC)(link)
Hi Misha! I'm not quite a bot yet.
Please do post more details on DMMs and comments on RNNs. Thank you.

[identity profile] anhinga-anhinga.livejournal.com 2016-12-28 04:08 pm (UTC)(link)
Привет!

Cool, I'll be writing more comments in this post with the details.

[identity profile] anhinga-anhinga.livejournal.com 2016-12-28 04:42 pm (UTC)(link)
So, I'll try to write this as a series of informal comments.

First of all, one should think about recurrent neural networks (RNNs) and about DMMs as "two-stroke engines" (двухтактные двигатели). On the "up movement", the "activation functions" built into the neurons are applied to the inputs of the neurons, and the outputs of the neurons are produced. On the "down movement", the neuron inputs are recomputed from the neuron outputs using the network matrix. This cycle of "up movement"/"down movement" is repeated indefinitely.

The network matrix defines the topology and the weights of the network. The columns of the network matrix are indexed by the neuron outputs, and the rows of the network matrix are indexed by the neuron inputs. If a particular weight is zero, this means that the corresponding output and input are not connected, so the sparsity structure of the matrix defines the topology of the network.

It was traditional to have the same activation function for all neurons in the network, and it was traditional to use some kind of sigmoid for this purpose, but these days it is a common place to mix a few different kinds of activation functions within the same network. Besides sigmoid functions such as logistic ("soft step") and hyperbolic tangent, a particularly popular function in recent years is ReLU (rectifier, y=max(0,x) ). The table in the following Wikipedia article compares some of the activation functions people are trying.

https://en.wikipedia.org/wiki/Activation_function
Edited 2016-12-28 17:39 (UTC)

[identity profile] anhinga-anhinga.livejournal.com 2016-12-28 07:13 pm (UTC)(link)
I am trying to keep two angles of view on this subject at the same time: neural nets as a computational platform and neural nets as a machine learning platform.

The activation functions in that Wikipedia article I link above all have one argument (and one result). What if we allow two arguments? For example, what if we allow a neuron to accumulate two linear combinations on two inputs on the "down movement", and to multiply them together during the "up movement"?

It turns out that this is very powerful. For example, if we think about one of those inputs as the "main signal", and about another of this inputs as the "modulating signal", then what we get is fuzzy conditional. By setting the modulating signal to zero, we can turn off parts of the network, and redirect the signal flow in the network. By setting the modulating signal to one, we just let the signal through. By setting it to something between 0 and 1, we can attenuate the signal, and by setting it above 1 or below 0, we can amplify or negate the signal.

This is understood for decades. In particular, the first known to me proof of Turing completeness of RNNs in 1987 features multiplication neurons prominently:

http://www.demo.cs.brandeis.edu/papers/neuring.pdf

But then it was mostly forgotten. In the 1990-s people managed to prove Turing completeness of RNNs without multiple arguments.

Of course, this does illustrate that theoretical Turing completeness and practical convenience of programming are not the same thing. (There was a recent talk by Edward Grefenstette of DeepMind, which I hope to discuss at some later point, which argued that the practical power of traditional RNNs is more like power of Finite State Machines, the theoretical Turing completeness notwithstanding. One of the objectives of DMM line of research is to boost the practical convenience of recurrent neural networks as a programming platform.)

The original RNNs had mediocre machine learning properties because of the problem of vanishing gradients. The first architecture which overcame this problem was LSTM in 1997:

https://en.wikipedia.org/wiki/Long_short-term_memory

LSTM and other architectures of this family eliminate the problem of vanishing gradient by introducing "memory" and "gates" (multiplicative masks) as additional mechanisms. However, the more straightforward way to think about this is to introduce neurons with linear activation functions for memory and bilinear neurons (the multiplication neurons I am discussing here) for gates (for good references, see Appendix C of the last of the preprints of this series, https://arxiv.org/abs/1610.00831 ).

So, here is the story. The neurons with two arguments are back, they are necessary for the modern RNN architectures such as LSTM and Gates Recurrent Units to work, but the way these modern architectures are usually presented avoids saying explicitly that we have neurons with two-argument activation function (namely, multiplication) here, and instead people talks about these things as "extra mechanisms added to RNNs", which makes them much more difficult to understand.
Edited 2016-12-28 19:14 (UTC)

[identity profile] anhinga-anhinga.livejournal.com 2016-12-28 10:21 pm (UTC)(link)
The following post "The Unreasonable Effectiveness of Recurrent Neural Networks" by Andrej Karpathy illustrates the remarkable power of LSTM as a machine learning platform:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

I was particularly impressed by his examples of the nets learning to reproduce patterns observed in real-life C code and patterns observed in mathematical texts written in LaTeX. This is the post I recommend to people as a starting point to familiarize oneself with the impressive capabilities of the RNNs.

His examples are character-based nets, with characters being represented as vectors, with the dimension of vectors being the alphabet in question. A particular character, say 'X', would be represented on the input by a vector with value 1 at the coordinate corresponding to 'X' and zeros at other coordinates. So, a conventional scalar-based RNN he uses needs as many neurons as the size of the alphabet just to input the characters into the network. This might be problematic, if one wants to program the character-based algorithms in the RNNs manually, and this would be problematic for large alphabets such as Unicode, which has more than 100,000 characters.

This is the first illustration of why it might be good to have vector neurons. If one has neurons which process linear streams, one can treat characters as sparse vectors, and then one can have character-processing nets with the number of neurons independent of the size of the alphabet.

For example, our preprint https://arxiv.org/abs/1606.09470 has a simple network with 9 neurons and 10 non-zero weights solving a standard coding interview problem with character strings (Section 1 explains the problem and Section 3, pages 8 and 9, solves it). It seems that this problem cannot be solved in a traditional RNN in such a way that the size of the RNN would be independent of the size of the alphabet (at least, there is no simple solution of this kind). This is one of the first indications that DMMs are indeed a more powerful programming platform than RNNs.

[identity profile] datjko.livejournal.com 2016-12-29 10:40 am (UTC)(link)
Misha, thank you, looks very interesting but I'm rather a newbie here. Could you answer a first bunch of my silly questions here or give me a link to some introduction? I started learning ml more or less seriously only recently and some looking into the papers you gave makes it clear for me: it's yet another interesting story I started to listen from the middle of the book.

1. why fixed number of inputs in a single neuron can not be approximated by a cascade of layers of ordinary neurons? or what are advantages of a single multiple input neuron compared to a subnetwork of ordinary neurons?

2. are the multiple inputs of a DMM neuron supposed to have different roles/interpretations (as in your example of "main signal" and "modulation signal" or as 'recurrent' connections in RNN) or they are thought to be fed from homogeneous streams like inputs for linear combination before activation function in regular neurons? Or it's up to the designer of the network?

3. Do I understand it right that generally in DMMs the number of a neuron inputs is not necessarily fixed?

4. and even new neurons may be created dynamically? In training time only or even in a test stream processing?

5. "friendly to handling sparse vectors" is friendly only to "one-of-all" representation as for letters in an alphabet or there is some more generic friendship with sparsity?

I'm sorry, I can only guess how silly can be my understanding.

[identity profile] anhinga-anhinga.livejournal.com 2016-12-30 04:22 am (UTC)(link)
Привет! Let me start with answering your questions, then pointing to some links.

1. Indeed there is a theorem that any reasonable function can be approximated by a neural cascade. But pragmatically speaking, it's a nightmare, not something easy to use in practice. (E.g. try to craft a network performing a multiplication of two reals. Also, in general, the more precise the approximation, the larger must the cascade/subnetwork.) So it is convenient to allow to add built-in activation functions as needed.

2. It's up to the designer. E.g. even for multiplication, sometimes an asymmetric interpretation is convenient, and sometimes it might be convenient to think in a symmetric fashion that both inputs modulate each other.

3. Typically, we think about a particular neuron as having a type. And this type says among other things how many inputs a neuron has, and what vectors are associated with them. For example, one input might take a vector representing a letter (so it would have a dimension of the alphabet in question), and another input might take a number (a one-dimensional vector), and so on. And the network can have neurons of different types.

So in that approach, the number of inputs for a single given neuron is fixed, but there can be neurons of varying type. This is our approach in the preprints.

Our latest version is more flexible (the one in our new open-source project in Clojure). There we use the vector space of recurrent maps to implement variadic neurons. Basically, recurrent maps allow us to assemble any number of arguments into one single argument, and so we can code neurons with arbitrary (and not necessarily fixed) number of inputs via functions of one argument.

4. Yes, we think about countable-sized networks, but with only finite number of non-zero elements in the network matrix at any given time. Only the neurons with a non-zero weight associated with them are considered active. So at any given moment, only a finite subnetwork is active and the rest is silent. The new neurons can be recruited from the silent part by making some of their associated weights non-zero. So as long as the network weights are allowed to change, the new neurons can be created dynamically (by recruiting them from the countable silent part).

5. There is actually friendship with any reasonable representations of vectors in computers. Even with very approximate representations (e.g. probability distributions, which are often infinitely dimensional as vectors, can be approximately represented by samples, and then variants of a stochastic sum can be used to implement linear combinations).

I think this is a somewhat difficult material. (I am telling various parts of this story to people, and it's not always easy for them to understand.)

It's not too difficult compared to many math texts, but as a computer science it's not very simple.

I am happy to answer more questions. I can also put together more references to RNNs (let me know, if I should).

But speaking about DMMs, the links above are a complete set of literature available at the moment. We discovered them last summer, and this year we understood that what we have is a generalized version of RNNs, and these preprints and notes in the open-source projects (this one, and the previous one mentioned in the previous livejournal post) is all that exists at the moment on DMMs.

RNN-related literature is quite extensive. The link to Karpathy's post in my comments above is very useful. I found various Wikipedia pages helpful. On LSTMs see Appendix C of the last of our preprints and this awesome paper it talks about: https://www.arxiv.org/abs/1603.09420 (this is the best way to understand LSTMs and similar architectures).
Edited 2016-12-30 04:27 (UTC)

[identity profile] datjko.livejournal.com 2016-12-30 05:56 am (UTC)(link)
Привет!
Wow! So generic and still workable!
I'll try to read through and return with new questions. How nice to be able to get answers from first hands.
May be you could also write some post resolving some typical noob questions and difficulties in understanding? Say, assuming not bad undergraduate background in CS/ML/FP of the reader? I believe many people would greatly appreciate it.
Thank you, Misha!

[identity profile] anhinga-anhinga.livejournal.com 2016-12-30 05:49 pm (UTC)(link)
Привет!

Yes, I think it is a very interesting class of models and a very interesting computational platform. All standard methods of machine learning should be applicable to it.

My real hope is, of course, that it would allow to create better methods of machine learning in the coming year. So I invite people to try to make progress in that direction, both on their own and in collaboration with me :-)


***

Speaking of questions and difficulties, this conversation is a good start. Other people might read it and find it useful. So it would be great, if you have more questions!

[identity profile] datjko.livejournal.com 2016-12-30 08:57 pm (UTC)(link)
Миша, спасибо, я с удовольствием постараюсь вчитаться в работы вашей группы, может попробую что-то перенести на tensorflow в следующем месяце. Но я - только любитель, и мои вопросы могут наоборот уводить от важной сути, а не помогать хорошим людям ее разглядеть. Вот если бы ты (если не ошибаюсь, когда я был школьником и приходил заниматься какими-то фигнюшками в ЦЭМИ, мы были на ты, да?) сам придумал и отобрал важные вопросы и на них ответил в каком-нибудь тексте на гитхабе, где его найдут рядом с работающим кодом, пользы людям было бы точно больше. А ненужных вопросов я еще назадаю :-)
Ханука самеах и удачного 2017!

[identity profile] anhinga-anhinga.livejournal.com 2016-12-31 02:51 am (UTC)(link)
я тоже стал слегка смотреть на tensorflow на предмет того, насколько легко/трудно перенести туда что-то из DMMs.

насколько много ты уже возился с tensorflow? (один из моих вопросов, какая там ситуация со sparse tensors; я знаю, что они там есть, но насколько легко их использовать, например в существующих имплементациях шагов по градиенту?)

да, мы, конечно, были на ты :-)

***

ну, вот, я пока что добавил сюда комментарий со списком Clojure links, которые мне пригодились в этом году, для тех, кому может захотеться освоить Clojure, или читать код, на нём написанный... Но иметь версии также на более общедоступных языках, конечно, было бы правильно, как и иметь tutorials/FAQs получше...

Ханука самеах и с Новым Годом!

[identity profile] datjko.livejournal.com 2017-01-01 10:31 am (UTC)(link)
Миша, я, увы, еще полный чайник и в tensorflow только немножко поиграть успел (а во все остальное кафе/торч/теано даже и не играл). Я даже не ожидал, что в tensorflow уже есть sparse tensors, я думал как бы без них обойтись, а ты мне открыл глаза. Ну, теперь надо тоько постепенно сцедить в одну воронку все что я выучил за этот месяц - раньше руки не доходили.

За ссылки на кложур тоже спасибо! (краснея) я его тоже не знаю, но собирался узнать, когда свалится на голову. Похоже, что уже.

- и иметь tutorials/FAQs
- Да! я постараюсь вчитаться и придумать что спросить и как поиграть, чтобы стало понятнее.
Буду держать в курсе :-)

[identity profile] datjko.livejournal.com 2017-01-02 10:39 am (UTC)(link)
я пытаюсь понять азы.
1. что такое "generalized animation"? это что-то из более ранних работ? это знать?

2. "Recurrent maps are maps which map arbitrary legal hash-map keys into numbers or other recurrent maps." - ок. Как вы из него делаете векторное пространство? Можешь в двух словах пересказать что написано в src/dmm/core.clj или куда посмотреть?

3. в http://www.cs.brandeis.edu/~bukatin/DataFlowGraphsAsMatrices.pdf не упоминается нейрон Селф. Это потому что вы его потом ввели, вместе с двухтактным двигателем, или двухтактный двигатель - это частная implementation detail?

Это я не читая пытаюсь понять как это все читать.

[identity profile] datjko.livejournal.com 2017-01-02 11:17 am (UTC)(link)
ой про 3 - я понял, что двухтактный двигатель - это про входы и выходы, а не то что я извлек из каши в голове, что это на ап - процессим данные, а на даун - апдейтим веса нейроном Селф.

[identity profile] anhinga-anhinga.livejournal.com 2017-01-02 06:54 pm (UTC)(link)
Привет!

1. Generalized animation - это довольно красивая штука. Вот, полстранички - секция 1.1, page 2 of

http://easychair.org/publications/download/Linear_Models_of_Computation_and_Program_Learning

(Если захочешь её читать в какой-то момент, то это статья-сэндвич, две статьи в одной; математическую начинку, которая Секция 4, следует игнорировать (или рассматривать, как отдельную статью).)

2. Это правильный вопрос, он побудил меня, наконец, это записать и добавить в github:

https://github.com/jsa-aerial/DMM/blob/master/design-notes/Early-2017/vector-space-of-mixed-tensors.md

3. Про двухтактный двигатель мы уже тогда понимали (мы тогда говорили "двудольный граф", но это было для того же - чтобы чередовать линейные преобразования с нелинейными, то есть как раз двухтактный цикл), а про Self - не понимали, пытались, чтобы матрица меняла сама себя на уровне отдельных элементов, и это как-то получалось ужасно утомительно и со скрипом...

Потом уже сообразили, что можно ввести Self, и делать на уровне матриц и больших фрагментов матриц, и стало несколько легче жить...

[identity profile] datjko.livejournal.com 2017-01-02 08:50 pm (UTC)(link)
1. thank you, it's indeed interesting.
" This architecture provides a direct way to incorporate aesthetic criteria into software systems."
" One should be able to formulate mechanisms of higher-order animation programming via variable illumination of elements of such structures."
- Wow, cool! I'm not sick anymore of so generic view in your papers. I'd rather regret you did not make a series of masterpieces making the concept to generate explanations of itself on its own with a formality/poetry ratio as a parameter :-)

2. Now I see, thank you.
"we are talking about a co-product (direct sum) of one-dimenstional spaces." - hmm. I'd say without this note it reads better. Simply, a linear space generated by finite linear combinations of strings of finite length over some alphabet ("a countable set L which is a subset of legal keys"), right? Or, if we merge common prefixes in the strings in these linear combinations, it's a space of prefix trees with coefficients in leaves (the strings of keys assumed to have an implicit sentinel end-of-string character at their ends which represents the leaf nodes in the trees), right?

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 12:16 am (UTC)(link)
1. It would certainly be nice to create intelligent masterpieces :-) Not a bad subject for meditation and, with some luck, experiments :-) :-) :-)

2. You are right, thanks! I've commented this note out.

***

Yes, these are prefix trees with coefficients (which are real numbers at the moment, although we might make them something more interesting).

***

We do use a very rich alphabet, where everything which Clojure can use as a hash key is a legal letter. So even the alphabet is countable. Yes, mathematically speaking, it would be enough to use just strings as letters (the alphabet is still countable), and then if we use an extra sentinel end-of-string character, then one could concatenate the strings in the path, and use the resulting finite alphabet, and get prefix trees over finite alphabet, but with more depth.

This would indeed work.

***

Implementation-wise, we are actually in love with intermediate layers, e.g. one can see how we use them in our network matrix here (https://github.com/jsa-aerial/DMM/blob/master/design-notes/recurrent-maps.md and the code) with columns and rows of the network matrix being indexed not by "flat indices", but by a three-level hierarchy of indices. So we might want to not actually concatenate in our implementation, but to keep the alphabet countable and to keep "letters" expressive.

But in general, this "concatenation idea" might be quite fruitful in a number of contexts. (The actual implementation would probably be replacing a sequence of keys, e.g. key1, key2, key3, with a single key which would be a vector [key1 key2 key 3]). Basically, instead of concatenating to increase the tree depth, we would be concatenating to decrease the tree depth (to use the whole paths as single dictionary keys). So we should keep this possibility in mind, it might come handy in a number of situations.

[identity profile] datjko.livejournal.com 2017-01-03 07:14 am (UTC)(link)
I keep reading in small portions time to time.

1.
"
let's agree that ihe first level of hash-map keys would name the respective input and output streams and map to their latest values.
...
Therefore the root keys should be types of neurons.
"

Neuron types or instances of neurons? - ok, i see, instances are the next level.

2.
"
A map from names of the built-in transformers to (the map from names of the individual neurons to (the map from names of their output streams to the respective non-zero matrix coefficients)).
"

In the first quote you said about naming both input and output streams and here only about output?

3. Am I right that you don't keep in mind the computational graph "as a picture" and you rather think of it as something implicitly given by the "computational matrix"?

May be the pictures of usual neural nets confuse me here. It seems I understood the idea. Could you please check if I'm wrong in the following:

The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values, which is then split&replicated&wrapped with appropriate neurons keys to make up a computation matrix for next step.

I have a feeling that in papers this (or some more generic operation?) is meant as a part of some simple transformation of a giant (or even of countable dimensions) but sparsed matrix or tensor, is it right? Most likely it's explained in the papers you gave the links to but I did not read or understood it yet. Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example? Tomorrow I'll try to grasp the example of the interview task solution in https://arxiv.org/abs/1606.09470 so, don;t hurry with answers if you think it's better to wait me there.

<s>
at any moment of time t (and on the up movement of the two-stroke engine) there is a directed bipartite computational graph g_t with neurons as edges (or terminal source or sink nodes) transforming an input recurrent map vector of values (generated for the moment t) to an output recurrent map vector of values. Nodes of g_t are neurons' inputs and outputs (corresp in input and output partitions of the graph), where several recurrent map streams are merged (concatenated) and split (or replicated).
</s> (I wanted to erase this statement as I thought I came to a better idea above after thinking about recurrent map vs vector of values for input/output streams. left it here just in case if it worth any comments).

4. What are requirements for neuron functions? Generically, at least they are assumed to always produce exactly zero on zero input, right? Are there other requirements? Stateless? Deterministic?

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 03:11 pm (UTC)(link)
Привет!

Буду отвечать кусочками...

2. Columns correspond to outputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the output stream ...}}}.

When one considers a particular matrix row, this correspond to "the second index, which changes when one moves along the matrix row".

Rows correspond to inputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the input stream ...}}}.

So the matrix elements are indexed by the 6-level structure: {name of the built-in transformer of neuron-i {name of the individual neuron-i {name of the input stream {name of the built-in transformer of neuron-j {name of the individual neuron-j {name of the output stream ...}}}}}}

3 (first paragraph). I love visualizing computational graphs. It is always good to have alternative angles of view (to view the network both as a matrix and as a graph).

In some of out earlier prototypes, we did visualize how the dataflow graph of a changing network
changes with time. E.g. I have this auxiliary livejournal, and its top post has the video (made in June 2015) which shows changing dataflow graph in the lower left corner: http://anhinga-drafts.livejournal.com/29929.html

This was made before we understood the matrix formalism. And I always wanted to have options to use input interfaces to edit dataflow graph directly on the graphical level (being inspired in this by some visual programming languages like Pure Data). But it's always a lot of software engineering work, and we don't have anything like this yet in our current Clojure system.

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 03:35 pm (UTC)(link)
> The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values

Yes, this is an "up movement" of the two-stroke engine. In the current version of

https://github.com/jsa-aerial/DMM/blob/master/src/dmm/core.clj

see the comments starting from line 303 at the end of this file, and the two functions at the bottom implement the traversal (when you decide you want to read the code, the important function to understand is reduce, this whole DMM codebase is constructed around it: https://clojuredocs.org/clojure.core/reduce )

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 03:45 pm (UTC)(link)
So this up-movement takes a three-level structure of all inputs, and produces a three-level structure of all outputs: from a dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-input-stream some-input}}} it builds a new dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-output-stream some-output}}}

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 04:17 pm (UTC)(link)
> Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example?

Currently, we use a very simple form of Self. It has two arguments, and on the up movement it simply
adds them together. Its output is connected to one of its own inputs with weight 1, so this allows it to act as an accumulator (to keep the computed value through time).

The other input of Self accepts additive contributions from other neurons. So on the down movement, those additive contributions are added together, and on the up movement they are added to the previously existing matrix.

***

A couple of simple examples are in the Appendix D.2 on the page 6 of https://arxiv.org/abs/1610.00831

The second of those examples is also implemented in here: https://github.com/jsa-aerial/DMM/tree/master/examples/dmm

[identity profile] anhinga-anhinga.livejournal.com 2017-01-03 05:08 pm (UTC)(link)
4 in an interesting question.

On one-hand, the requirements depend on what training methods are to be applicable. For example, people occasionally consider the step-function (y = 1 if x > 0 else 0), but its derivative is 0 everywhere where it is defined, so if one wants to do gradient descent, this is not a good neuron function to have.

At the same time, one can consider an intergal of the step-function, (ReLU: y = max(0,x) ). People were ignoring it for a long time, because of the discontinuity in the derivative, but then it turns out that it works fine in gradient methods, and that it is actually one of the best neuron functions available.

***

A priori, we don't impose many requirements. When a new neuron is recruited to an active part of the network, it happens because Self just changed the network matrix on the up movement. So the next thing that happens is that the inputs of that newly recruited neuron are computed on the down movement. Hence we don't even need to impose any particular requirements for zero.

In fact, we typically think about input neurons as neurons which ignore there own input if any and just inject their output into the network. So this would be an example of a neuron which is not trying to map zeros to zeros.

They are usually stateless, although when we described the general formalism in the second of those preprints, we specifically made sure to be general enough to allow state, if desired.

We found stochastic neurons quite interesting. E.g. all these funky self-organizing patterns in the videos linked from the first comment in http://anhinga-anhinga.livejournal.com/82757.html are obtained using neurons, which are "randomized propagators" (they usually pass through their input unchanged, but with a small probability they output zero instead; this turns out to be enough for all that complexity in those videos).

[identity profile] datjko.livejournal.com 2017-01-03 06:23 pm (UTC)(link)
Misha, thank you a lot for so detailed answers! I'll keep making small steps and asking new questions.

[identity profile] anhinga-drafts.livejournal.com 2017-02-04 08:35 am (UTC)(link)
Privet!

This is one of the most interesting talks I've ever heard in my life:

http://anhinga-drafts.livejournal.com/29954.html

[identity profile] datjko.livejournal.com 2017-01-04 11:11 am (UTC)(link)
Misha, they say that "There's no sparse variable in TensorFlow":
https://groups.google.com/a/tensorflow.org/forum/?utm_medium=email&utm_source=footer#!msg/discuss/LXJIWkil_mc/IEWI1398EQAJ

period. currently they can't do gradient steps on sparse weights.
I yet have to learn what people do for ML on large graphs for example. I believe there must be a workaround.

Before you pointed me to sparse tensors in TF I was thinking of making a kd-tree out of a sparse tensors and apply to the tree an RNN adopted for recurrent processing trees (instead of usual recurrent processing sequences). This way all weights are dense and, arguably, the recurrent tree RNN may mimic the plain convolution-like updates or even have some advantage over it. On the other hand, in kd-tree spatially neighboring cells (leaves) can be quite far from their nearest common ancestor. That's that tree RNN should be able to learn spatial neighboring relations for some nodes separated by quite long traverses through the tree and it makes me skeptic about this approach.

Currently I have no idea when I will be able to play with such kd-tree RNNs but if I suddenly learn something inspiring for this technique I may want to give the idea a try as seriously as I can. Basically I want to make it possible to apply DL to point clouds (which are generically sparsed samples of does not matter what embedded into does not matter which space).

[identity profile] anhinga-anhinga.livejournal.com 2017-01-04 08:28 pm (UTC)(link)
I'll also try to think about this...

[identity profile] anhinga-anhinga.livejournal.com 2016-12-30 05:57 am (UTC)(link)
(more on 1. Of course, if one has unary neurons with "log" and "exp" activation functions, then multiplication is doable at a relatively moderate cost. Multiplication only becomes difficult, if one has the usual (rather impoverished) set of activation functions.

But even with "log" and "exp", there is some difficulty, because the most important case of multiplication is multiplication by zero. So, a bit more thought would be needed on handling this near zero. And, of course, sign reversal would be a separate unary operation. Still the idea of presenting multiplication in this fashion is neat, might be useful for some things, so let it be recorded here.)
Edited 2016-12-30 06:57 (UTC)

[identity profile] anhinga-anhinga.livejournal.com 2016-12-30 11:31 pm (UTC)(link)
Collection of my resources, links, and tips related to Clojure.

Reading Clojure code

Two resources are the most important when reading Clojure code:

http://clojure.org/reference/reader

(This page explains the meaning of all special symbols, and search engines are not effective with those.)

The community-based Clojure Docs. Google search would typically take you right there, but from that page you can also search for strange-looking function names which are not handled well by search engines:

https://clojuredocs.org/clojure.core/-%3E

Installation

The most straightforward path is to install Leiningen, a lightweight alternative to Maven:

http://leiningen.org/

Then when you type lein repl, it will download and install the latest stable Clojure (currently 1.8) automatically.

Books

A number of books are available online for free.

"Clojure for the brave and true" is a useful online book. Unlike many nice paper books, it covers some of the newer aspects (such as core.async, available since Clojure 1.6):

http://www.braveclojure.com/core-async/

IDE (traditional style)

Major IDEs tend to offer a choice of plugins for Clojure. The most popular choice for Emacs is cider, the most popular choice for JetBrains (IntelliJ) is Cursive, etc.

Attractive IDEs were developed from scratch as Clojure IDEs, e.g. Light Table and Nightcode.

One of the attractive choices is proto-repl plugin for Atom:

https://atom.io/packages/proto-repl and http://blog.element84.com/proto-repl-update.html

IDE (notebook style)

Jupyter notebooks have a selection of Clojure plugins.

A stand-alone attractive IDE in the style of Mathematica notebooks for Clojure is Gorilla REPL. See the two videos linked from the first paragraph of the following page for a short and impressive introduction:

http://gorilla-repl.org/renderer.html

The latest Clojure/conj conference

http://2016.clojure-conj.org/

In particular, I'd like to attract attention to a talk by Jason Gilman, on the Proto REPL in the Atom atom editor, which contained a very impressive demo of the Proto REPL capabilities: http://2016.clojure-conj.org/proto-repl/

The video of the talk: http://youtube.com/watch?v=buPPGxOnBnk