High-Dimensional Computing With Sparse Vectors Using Clojure
Table of Contents
- Overview
- Introduction
- Implementation
- The Address Decoder Neuron
- The Elements of High Dimensional Computing
- The Means of Combination
- What Is The Dollar In Mexico?
- The Means of Abstraction
- Prototypes
- Programming In Superposition
- Ambiguity Primitives
- Sparse Distributed Memory
- Conclusion
- Outlook (Notes)
- Appendix
- Literature
My current explorations in hyperdimensional computing. Kinda piled on and on so I just put it out there now.
Overview
Core implementation
- implemented binary sparse segmented VSA (BSEG-VSA), also called Binary Sparse Block Code (B-SBC) in Clojure,
- using dtype-next (jvm)
- libpython-clj, pytorch and torch.cuda (code: torch).
- this was done in python in a more complete framework here: torchhd
Sparse Distributed Memory (SDM)
- Implemented a Sparse Distributed Memory SDM compatible with B-SBC. Introduction, exploratory code.
- Future: Add a version of this to torchhd
Multi-Value Mode SDM
- Discovered a multi-value mode of SDM, utilizing the B-SBC design code comment. This is a potential post-transformer design, in light of Attention Approximates Sparse Distributed Memory [Bricken 2021].
K-Fold Memory Implementation
- Implemented a k-fold memory (P. Kanerva 1988), using a delayed address decoder design (code). This might be somewhat clumsy compared to other HDC sequence structures, but is interesting in light of the SDM-Cerebellum mapping.
Biological Implications
- I argue that the delayed address decoder might be implemented by parallel fibers pf as delay lines in the Cerebellum, making the k-fold memory a desgin that unifies clock-based [Braitenberg 1967] (assumes delay lines) and learning-based [Albus-Marr Theory] (does not account for the thinnnes of parallel fibers) theories of Cerebellum. This promises to increase the biological plausibility of the SDM-Cerebellum correspondance. And potentially contributes to a deeply engineering design-based framework of understanding Cerebellum-like structures [review].
- I argue that the address decoder neuron (P. Kanerva 1988) is isomorphic to a special case of a reader centric cell assembly [G. Buzsáki 2010, 2]. This might help bridge between the neuroscientific concept of neuronal ensemble (same as cell assembly) and hyperdimensional computing.
- Using this framework to map HDC and biology, multi-value mode SDM could implemented by variable excitability, via sheet-inhibition of Purkinje Cells.
- This might correspond to Braitenbergs cores and halos [Braitenberg 1977], and Hebbs fringes [D. Hebb 1949].
Practical Applications, exploring advanced computing paradigms
- Made some small examples using concept blending, programing in superposition 8, and analogical reasoning, Section: Programming In Superposition, example exploring structure preserving properties of HDC/VSA.
- Wrote down ideas for a language of ambiguity.
Introduction
Now we have a system where we can actually represent structure. We could implement Lisp. And if you think of computational linguistics, 30 years ago, 40 years ago, 50 years ago. It was all Lisp programs. So then came neural nets and this tradition was sort of lost. We can bring this together. With this system, we can bring them together.
Hyper or High Dimensional Computing (HDC) is a neuromorphic computing approach, that handles both ambiguity and structure in powerful ways ( 3, 4), and is useful for biologically plausible cognitive modeling1.
I prefer hyper-dimensional because it sounds more like alien tech.
It provides a joyful approach to yet-to-crystalize neuro-symbolic programming techniques.
Its hallmarks are mathematical rigor (butchered here), its cognitive connotations (What Is The Dollar In Mexico?) as well as its biological plausibility. Following a tradition of John Von Neumann's The Computer and the Brain, it explores the question what are the properties of (neuro-) biological computing? This a somewhat different branch of thinking, most exemplified in the seminal and inspiring Sparse Distributed Memory (Kanerva 1988) 3.
In HDC information is represented as high-dimensional distributed "holographic" data structures, the hypervector. It is typically a very wide vector (vector length is, N >= 1000, or even N = 10.000)2 and is closed under the VSA (Vector Symbolic Arithmetic) operations3.
Advances in the philosophy of programming, biophilosophy and epistemology are required for solving long-lasting engineering problems for creating truly dynamic, creative, robust computer systems 4,5,6.
Current connectionist machine learning paradigms lack explainability and reliability, (they frequently "hallucinate"), and show a lack of reasoning capability that would covered by basic world models, such as object permanence7. Challenges for connectionist approaches remain largely untouched [G. Marcus 2003, Minsky and Papert 1969, see 8, F. Chollet9].
Surely, new and old ideas in neuro-symbolic or perhaps alternative approaches like a-life, or genetic programming are required for true progress in machine intelligence. It is often said that nobody figured out yet how to do neurosymbolic (ai) (Melanie Mitchel, and Gary Marcus on YouTube podcasts are eloquent).
HDC/VSA has the potential to fill such gaps.
Clojure is a powerful tool for modern software engineering, prototyping and bottom-up programming. Regardless, it is a Lisp and is intertwined directly with the history of machine intelligence, early computer science and the philosophy of programming.
Carin Meier introduced a Clojure way of doing HDC/VSA, on which I try to build on, towards "small AIs for personal enjoyment"(Carin Meier). Her communication on explorative programming, like alternative programming paradigms is quite inspiring to me; And I feel "a Clojure thing to do".
Why Clojure? Because why would you not want a REPL into your AI?
Here, I provide an example implementation for a binary segmented sparse (BSEG) VSA using an excellent Clojure high-performance framework dtype-next.
Sparsity is a theme of HDC and related fields. Presumably, the brain uses the most sparse encoding it gets away with, because the amount of active neurons drives brain energy consumption.
the most important thing in AI right now perhaps is not Consciousness and things like that but how do you make it energy efficient because that is the constraint.
G. Buzsáki10
It is circles, because neuronal nets are often depicted with the neurons as circles and lines between them. Squares to make it look different and the 'tree' stands for datastructures, like trees. The idea is that currently each approach is strong in either ambiguity (relationships of many data points) or structure, but not both.
I often use blue to stand for data and green to stand for process. This is used here for flair.
- Picking a random hypervector (a seed vector) is akin to a
gensym
. - Hyperspace is virtually infinite for practical purposes.
- Hypervectors can be used to model semantic spaces.
- HDC/VSA can encode structures like frames, (associative data structures), and sequences.
- HDC/VSA can be used to compute analogy-like reasoning What Is The Dollar In Mexico?, a single example of template-like reasoning (1), while being transparent.
- HDC/VSA can be seen as a theory of programming emerging stochastic (unreliable, low voltage) computing technologies like Memristors and quantum computing, and its most distinguishing feature might be "computing in superposition"8.
- bloom filters are shown to be a subclass of HDC/VSA11. With a bloom filter, all elements are tested for set membership in a single step, in parallel.
- HDC/VSA can be seen as 'programming with embeddings', if one is machine learning-ly inclined.
- Yet, using VSA we can represent bindings, (key-value pairs), sequences, graphs and so forth; Programming In Superposition
- Beyond the "set-like,", unordered, membership/similarity quality of both bloom filters and embeddings.
- All operations return new hyper vectors of the same lengths, composite datastructures are hypervectors, too.
Implementation
I followed High-Dimensional Computing with Sparse Vectors [Laiho, Poikonen, Kanerva, Lehtonen 2020].
I used Chris Nuernberger's dtype-next.
The Address Decoder Neuron
It should also be noted that the message system used in the nervous system, […] is of an essentially statistical character. In other words, what matters are not the precise positions of definite markers and digits, but the statistical characteristics of their occurrence, i.e. frequencies of periodic or nearly periodic pulse-trains, etc. Thus the nervous system appears to be using a radically different system of notation from the ones we are familiar with in ordinary arithmetic and mathematics: instead of the precise systems of markers where the position - and presence or absence - of every marker counts decisively in determining the meaning of the message, we have here a system of notations in which the meaning is conveyed by the statistical properties of the message. […] This leads to a lower level of arithmetical precision but to a higher level of logical reliability: a deterioration in arithmetics has been traded for an improvement in logic.
John von Neuman The Computer And The Brain, 1957
In Sparse Distributed Memory, Pentti Kanerva provides an alternative view on artificial neurons, the address decoder.
Suppose that the neuron A
has 1.000 inputs (N = 1.000). In the book, the address a
(a point in H
) is made from -1 and 1's.
The 'optimal' input to the neuron is the address. If the inputs are equal to the address, the sum of the weighted inputs is s
= N
= 1000.
This is true, because the -1 inputs are -1 * -1 = 1 and the +1 inputs are 1 * 1 = 1, and sum(weighted-inputs)
is N
= 1000.
We see that a threshold can be expressed as the distance d
(the Hamming distance, that is the count of bits that are different) between the neuron's address and the inputs.
For instance, if the threshold is N
- 1 = 999, then the distance d
= 1.
And 1 bit of the input is allowed to be different from the neuron's address for the neuron to still fire - a wiggle room of 1.
The response region of the neuron A
is a circle in a high dimensional space, around the address (a
).
This region includes a very small amount of points of the space H
, until at around d = N/2 it suddenly includes most of the points.
These properties of H
are one of the main interesting aspects of HDC, and they are beautifully explained in 1,3, and Kanerva talks.
If d = N, then threshold = 0 and it includes the whole space. Then A
fires for any inputs.
Here, I use binary sparse vectors, and roughly the same thing works for them too:
Say the threshold is theta
= 3, then the inputs and weights need to have at least 3 bits overlap for the reader to fire.
This perspective makes it easy to see how neurons can deal with distributed representations.
Distributed data structures allow us to mix information. Half of a data point is meaningful so mixing half and half of two is also meaningful. This opens the door for programming with superposition, where ambiguity and multiple things true are expressible.
If the input vector is large (1.000 <= N, here N = 10.000), we say it is a high dimensional or hyper-dimensional distributed representation. It is a distributed data element because the bit locations contribute equally to the information content. You can see this is true via the logic of the sum in the mechanism. The output of the sum doesn't have information about locations anymore - all overlap bits contribute equally. This is different from say 32-bit architecture, where the bit location conveys meaning.
Similar concepts are sparse distributed representations SDR (if sparse and N ~= 2.000), holographic reduced representations HRR3, spatter code (if binary), distributed parallel processing (PDP) (1 reviews). "Cell Assemblies" and "distributed representations" share conceptual history with HDC, Hebb (1949), Marr (1969), Willshaw (1981), Palm (1980), Hinton, McClelland, and Rumelhart (1986), Kanerva (1988), and others. (I took this list from Rachkovskij and Kussul 2000)
In machine learning, almost the same thing goes under the name of distributed representations or word embeddings (like word2Vec). Yoshua Bengio recently went as far as speculating that something like word embeddings is the mechanism for qualia9.
I think there is a very close correspondance between the address decoder and the reader-centric cell assembly of Buzsáki12:
- They share a downstream
reader
. - They fire together in a time window given by the integration time ("tau"13 ) of the reader.
Additionally:
- A time window is required for a physiological implementation. (In other words, what counts as synchronous, or co-activation is defined by the integration time of the reader).
- We see that the address decoder was the simplest case where the time window is equal to the neuron time step time. (I.e. all neurons have to fire together during 1 time step).
- The time window could be increased, perhaps dynamically - as is the case in the brain via neuromodulators etc14.
This is a bit of an aside, but
- A reader is in principle allowed to be any downstream effect, as long as there is some use to something, somewhere.
- Imagine a domino-falling cognition machine, as Hofstadter asks us (Hofstadter 2007). Say every domino falling gives off a small amount of heat. If a domino falls without causing another domino to fall we might label it reader-less, hence user-less and useless. You anticipate where this goes.
- Not so fast! If the heat has effects in the system, for instance, if dominos fall faster with higher temperatures in the machine, or any variation of this - then even such a seemingly lonely domino did something to the system.
- This suggests at least 2 classes of effects for the firing of neurons in a brain, firstly the ones that cause reader neurons, to fire, and secondly, the ones that I would label substrate effects15.
But for both of these, I think we should ultimately be able to conceptualize in terms of an address decoder or assembly. It is just that some assemblies have different kinds of time windows and so forth.
That would be something like a 'universal' cell assembly hypothesis; Saying that cell assemblies don't actually need reader neurons to have effects; They have any kind of reader mechanism or downstream effect. And it is still useful to analyse the assembly.
- Because of this, I don't say address decoder neuron but address decoder.
- The reader can take into account the activity of a "whole" (distributed), this makes this a robust encoding. Single elements are allowed to be missing or out of sync etc.
- An active data structure, as we might conceptualize neuronal ensembles, must have a reader, or some downstream effect. In other words, all meaning is relational.
- Also, different readers can have different perspectives on the same data.
The important thing if this something is used by the brain. If you don't have an impact on something, it doesn't matter.
G. Buzsáki10
See Neuronal Ensembles for the neuroscience cell assembly concept. I highly recommend G. Buzsáki The Brain From Inside Out on this topic.
A very similar thing are the Population Vectors [Georgopoulos 198616], which are foundational to brain-computer interface technology.
The Elements of High Dimensional Computing
The hypervectors (abbreviated hv, hd, hdv, hdvs for plural) are high dimensional vectors, typically 10.000 elements long. It is sufficient for them to be binary, they still exhibit interesting high-dimensional properties.
The Binaray Segmented Sparse HV
;; ------------------- ;; 1. Elements - The Hypervector ;; ------------------- (defn ->hv " Returns a fresh, random hypervector - the element of VSA. __Binary Segmented Sparse__ : <------ N bits wide, typically 10.000 ---> +-----------+-----------+------------+------------------+ | _ _ _ _ 1 | _ 1 _ .. | 1 | | +-----------+-----------+------------+------------------+ s s s ... segment-count (~100) <-----------> segment-length (~100) Each segment has 1 (random) bit non-zero. In order to make it sparse, we only set 1 bit per segment. For operation, see: [[thin]], [[maximally-sparse?]], [[similarity]], [[superposition]], [[bind]], [[unbind]] " ())
This is also called a Sparse Block-Code17,18 because non-zero bits are only allowed to go into certain positions. (1 per block).
I usually draw a hypervector as colored-in circle and denote them with single lowercase letters (a
, b
, c
, x
, …).
Three hypervoctors, or hypersymbols if you like:
György Buzsáki was asked:
If it is true that we have preexisting patterns that we later map onto something, that means that there are only a limited number of things we can learn.
Exactly. But those limited patterns are limitless in your |< __ >| … 100 years of life.
Kanerva (1) is describing the high dimensional space H
. For any given point a
, almost all the other points have roughly the same distance to a
(quasi-orthogonal).
In hyperspace, all the points are so far apart that you can plug a new point again and again, and you will never get 2 that resemble each other; That's the property that György Buzsáki sees for computing with many neurons19, too.
We can pick a random20 hyper vector a
(high dimensional vector, HDV) from H
. Also called a seed-vector
. These are the elements of high-dimensional computing.
In HDC we deal with high-dimensional points and their relationships (similarity
) to each other.
We usually need a cleanup memory (Auto Associate Memory) to recover a seed back from a noisy hv.
- HDV creation doesn't always have to be random seeds, one can choose to project or embed from some world domain
- You might also choose to encode scalars for instance via some scheme, so that 9 is similar to 10 and so forth.
- Instead of random seeds, one can plug from a pool of pre-allocated seeds, which are already stored in some form of memory (like Buzsáki's hippocampal trajectories).
- This solves half of the memory problem. Even though the space of possible 'seeds' is reduced; The math is in our favor. The blessing of dimensionality quickly gives us enough data points to go for the rest of the sun's lifetime.
Similarity
A simple distance measure suffices for HDC. For instance, the Hamming Distance counts the bits that disagree
a: [0 0 0 1 0 ] b: [0 0 1 0 0 ]
hamming distance = 2
Consider that for sparse vectors, the hamming distance is dominated by all the 0's agreeing. The above pair has a distance of 'only' 2, yet is as far apart as possible (k = 5, count-non-zero = 1).
So I take the overlap of non-zero bits and normalize with the sparsity.
(apply max (let [a (hd/->hv)] (for [n (range 50000)] (hd/similarity a (hd/->hv))))) 0.07
Noice Tolerance
The similarity
of 2 random hypervectors is guaranteed to be tiny.
Here, 0.1 <= similarity(a,b) means that a
and b
are related.
One can also imagine trying to spoof an address decoder neuron, say N = 1.000 and bits are {1,-1} binary and the address vector is dense.
If the address is a
and the current input ensemble is the unrelated b
, the distance between a
and b
is roughly d(a,b) = N / 2 = 500.
Let's say the neuron threshold (distance to address) is d = 400 = N * 0.4, then you would have to spoof d(a,b) - d = 100 input bits in the right direction towards address a
.
Say you are a stochastic process like unreliable hardware, for instance, random skip rate and random intrinsic firing rate of neurons.
Say that you have a coin flip chance to flip a bit, then for the address neuron to misfire for inputs b
would mean to throw a coin 100 times correctly, i.e. 0.5100, which is astronomically unlikely.
(Math/pow (/ 1 2) 100)
7.888609052210118E-31
In other words, if one picks points from H
randomly, there is only a tiny region around a
for with the address decoder neuron responds. Even if there is a 400-bit (d = N * 0.4) wiggle room for the neuron to be spoofed.
It is quite remarkable that 40% of the bits are allowed to be random in the input, and the address decoder can reliably differentiate noise from its address. P. Kanerva: 1,4,3.
This is impressive already with N = 1.000, but here N = 10.000.
This holds to a lesser extent for sparse vectors, but the properties of sparse HDVs are still similarly impressive (Z. Yang for a more mathematical summary)21.
One of the deep mysteries of neuroscience is that there is a 40% skip rate for neuron firing 5. This is called failure rate in the literature. But perhaps it is a feature, not a bug.
Perhaps it is the amount of skipping the brain gets away with, maximizing energy efficiency.
Or perhaps this roughly makes the address inputs hover right around the detector threshold. Intuitively, this is like rolling the ball on the top of an attractor landscape in a recurrent neuronal net, making it less stuck in any basin.
I am imagining a piano, where the notes are the neuronal ensembles, a good piano will be fluid, and all keys (notes) are easy to compose.
—
Hearing the voice of a person, say 'Bob', always sounds like the same voice, even though the particulars of the sensor data must be quite variable. In Hofstadter's terms, we make an analogy between the instances of hearing Bob's voice. We understand/categorize/make the analogy / analyze the situation / find the essence of the 'hearing Bob's voice situation'. Whatever represents such an internal situation is quite robust to noise. See Biological Remarks On The Robustness of The Address Decoder (appendix)
Auto Associate Memory
auto-associative memory: A memory where the address is the same as the content. This is akin to associating a pattern with itself, allowing a pattern to complete. That is retrieving a pattern from a noisy version of the pattern.
If we allow querying with a-prime
, which is different, but similar to a
. You get a
out, with the noise cleaned up.
A cleanup memory component of an HDC/VSA allows us to recover the initial hyper vector (seed vector a
) from a noisy a-prime
.
Hetero Associative Memory
A hetero-associative memory is a memory where address and content are different things. For instance in RAM, the address locations are designated pointers, usually numbers in the word length of the computer (32bit, 64bit,…). The content of the address location is an unrelated string of bits. I.e. all possible bit string permutations go together with any location address. Address and location are different, hence hetero associative.
In HDC/VSA a hetero-associative item memory is needed to map from the hyperdimensional domain the the symbolic domain and back. (Usually, one would remember all seed vectors one creates. Then the item memory can be queried an provide the nearest neighbor of the seed vectors).
Ambiguity
There is one other thing you can do with a single hyper vector. That is dropping random bits and thereby getting a 'weaker' version of your data.
(defn weaken "Returns a weakened vector by dropping segment bits from `a`. This allows one to express the notion of a mix of hypervectors, where they contribute to different amount. Weakening the effect of `b` in a superposition: (let [a (->hv) b (->hv)] [(similarity a (thin (superposition a (weaken b 0.5)))) (similarity a (thin (superposition a b)))]) [0.76 0.58] Mathematically, this is like drawing a line from a to b in hyperspace. Bundle finds a point exactly between a and b. distance(c,b) ≅ 0.5 ≅ distance(c,a). With the weaken operation, you can say how far to move from a to b. Like moving on the line between the points. This is deterministic. " [a drop-ratio])
Instead of dropping randomly, here I let the bits decide which drop. So this is a kind of Context Dependend Thinning.22
(def a (->hv)) (similarity a (weaken a 0.5)) 0.5
This makes it possible to say things like a little bit of, roughly, mostly… See Ambiguity Primitives.
The Means of Combination
HDC has 3 primitive arithmetic operations, superposition
, which is akin to + or creating a set,
bind
, which is akin to * or creating a key-value pair, and permute
, a third operation sharing some
properties with * but works on a single input vector.
They are all closed under the hyper vector datatype. A vector of the same dimensionality comes out. (That is what reduced in Holographic Reduced Representations, HRR, 3 means).
Superposition (Bundle)
(superposition a b)
returns a vector c
that is similar to both a
and b
.
This is typically a elementwise addition, optionally followed by normalization and tie-breaking.
Superposition is commutative and associative. This is also called a sumset operation because it returns a set of the inputs. Just like with a set in programming, it is a data structure representing all the elements without ordering.
Superposition is the same operation that creates a set.
Other names: bundle, possibly, mix, set, union.
Thinning
In this implementation, we might want to thin
after a superposition, to preserve the sparsity of our vectors.
Thinning can be done by selecting one random 'on' bit per segment.
Or, one can use the indices to decide, making it a deterministic operation.
Bind
The bind
of A
and B
returns a vector C
that is dissimilar to both A
and B
.
C
can be used to release either A
or B
, given the other.
It can be used to represent a symbol-value binding.
- Bind distributes over addition
(require '[bennischwerdtner.hd.binary-sparse-segmented :as hd]) (let [a (hd/->seed) b (hd/->seed) c (hd/->seed)] (= (hd/bind c (hd/bundle a b)) (hd/bundle (hd/bind a c) (hd/bind b c)))) true
- Bind preserves distance
;; 'preserves similarity' in this context means that points that are similar before the transformation ;; are also similar after the transformation. (let [a (hd/->seed) b (hd/->seed) between-a-and-b (hd/thin (hd/bundle a b)) c (hd/->seed)] [(hd/similarity a b) (hd/similarity a between-a-and-b) (hd/similarity b between-a-and-b) (= (hd/similarity (hd/bind a c) (hd/bind b c)) (hd/similarity a b)) (= (hd/similarity (hd/bind a c) (hd/bind between-a-and-b c)) (hd/similarity a between-a-and-b))]) [0.0 0.3 0.7 true true] ;; To put this into other terms, C is mapping from one domain to another domain, preserving relationships
- Depending on the implementation, bind is commutative and/or associative.
Here, bind is commutative and associative: (for maximally sparse vectors).
(= (hd/bind (clj->vsa :a) (clj->vsa :b)) (hd/bind (clj->vsa :b) (clj->vsa :a))) true (= (hd/bind (clj->vsa :c) (hd/bind (clj->vsa :a) (clj->vsa :b))) (hd/bind (clj->vsa :b) (hd/bind (clj->vsa :c) (clj->vsa :a)))) true
Permute
- Like bind, distributes over addition.
- Like bind, preserves distance.
- Distributes over bind.
Here, we permute by rotating the segments blockwise.
(def a (hd/->seed)) (def b (hd/thin (hd/bundle (hd/weaken a 0.25) (hd/->seed)))) ;; permuting creates a vector dissimilar to 'a' (hd/similarity (hd/permute a) a) 0.0 ;; permute-inverse is the inverse of permute (= (hd/permute-inverse (hd/permute a)) a) true ;; permute-inverse is simply permuting with negative ;; n. It just means rotate the other way (= (hd/permute-n (hd/permute-n a 2) -2) a) true ;; permute preserves distance (hd/similarity a b) 0.37 (hd/similarity (hd/permute a) (hd/permute b)) 0.37 ;; permute distributes over addition (= (hd/permute (hd/bundle a b)) (hd/bundle (hd/permute a) (hd/permute b))) true ;; permute distributes over bind (= (hd/permute (hd/bind a b)) (hd/bind (hd/permute a) (hd/permute b))) true
- Also called rotate, protect - unprotect
- You can think of it quote - unquote since the permute can be interpreted as a quotation of a symbol.
- We 'map into an unrelated domain'
- Permute, like bind, is said to be randomizing, because the input is not resembling the output.
- It is interesting to note that permute is trivially implemented in circuits, just make wires that represent the permutation between an input and output register.
The humble permute turns out to be a powerful, simple component for building complex datastructures. See Programming In Superposition.
What Is The Dollar In Mexico?
The classic example works:
(def mexico-record (h/thin (h/bundle (h/bind (symbol->hv :capital) (symbol->hv 'mxc)) (h/bind (symbol->hv :currency) (symbol->hv 'peso)) (h/bind (symbol->hv :name) (symbol->hv 'mex))))) (def usa-record (h/thin (h/bundle (h/bind (symbol->hv :capital) (symbol->hv 'wdc)) (h/bind (symbol->hv :currency) (symbol->hv 'dollar)) (h/bind (symbol->hv :name) (symbol->hv 'usa))))) (let [result (h/unbind mexico-record ;; this represents the query (h/unbind usa-record (symbol->hv 'dollar)))] (cleanup-lookup-value result)) ;; => peso
Observe that we effectively unbind with the query over a distributed, superimposed mapping. The deep reason is that bind distributes over superposition, the outcome is equivalent to the superposition of unbinding each key-value pair. In an HDC/VSA "All data structures are hyper vectors and can be manipulated immediately and in parallel, regardless of how complicated a structure they possess" 8.
While this example is coherent on it's own terms, the real world factuality of this depends on the perspective of the reader.
I'm guessing one of my readers is right when they said:
I think "peso" would be the answer to a question close to: "what is the common name for medium of exchange that's backed by the government of Mexico"
The Means of Abstraction
WIP.
Prototypes
- When the bind is commutative, the notion of key and value is not fixed.
(cleanup-lookup-value (h/unbind usa-record (symbol->hv 'dollar))) :currency
- we can ask what is
dollar
, and getcurrency
out, just as well.
This suggests that can also simply use the first instance of encountering something, like encountering dollar
and use that as a proto-symbol, or a prototype.
Thenceforth, the symbol dollar could represent the role currency
.
So for Mexico, we would then say dollar<->peso, using 'dollar' as symbol for 'currency'.
Hofstadter and Sander in Surfaces and Essences 23, describe how children would acquire meaning cities, for instance, the symbol mum
would undergo a maturity process, where at first it refers to a single person, and eventually refers to the role mum in a mum-child relationship, (other people have mums, grandmums are mums of mums).
If I remember correctly, David Deutsch (or was it Hofstadter and Sander?) talked about something similar. The fact that Galileo discovered a Jovian moon is relevant because now we refer to moons in plural. It is that the notion moon grew into referring to a role, that earth-moon is not singular, in some essential way, but that there are many earth-moon-like relationship-having planetary objects.
Programming In Superposition
Because it distributes over permutation and binds, generally the VSA operations do the same for a superposition, as they would for the elements.
A single bind of superpositions makes a cross product:
;; this is the crossproduct the 2 sets (hd/bind (hdd/clj->vsa* #{:a :b :c}) (hdd/clj->vsa* #{:x :y :z})) ;; contains all key-value combinations (hdd/cleanup* (hd/unbind (hd/bind (hdd/clj->vsa* #{:a :b :c}) (hdd/clj->vsa* #{:x :y :z})) (hdd/clj->vsa* :y))) '(:b :a :c)
I followed some of 8 and implemented some complex data structures.
For explanation, I can recommend 8. Fun with trees is more of a walkthrough.
Note that I came to the same Clojure->VSA vocabulary as Carin Meier:
Clj {}
: Becomes the superposition of the bound key-value pairs (equivalent to the undirected graph).
Clj #{}
: Becomes the set.
(I skipped interpreting []
atm).
Principles (So far):
- Everything is a set, complex datastrucutes overlap, if they share structure.
- Intersection, difference, union work for complex datastructures generally.
- Computing with a set (a superposition) is like computing with an element, and the rest comes along for the ride, in parallel.
Basics
To go from symbolic domain to hyper domain, clj->vsa*
creates a new seed vector, remembers it in an item memory
and returns it.
cleanup*
serves as an item memory lookup, retrieving the closest match (with a threshold). I.e. hyper domain to symbolic domain.
(hdd/clj->vsa* :a) #tech.v3.tensor<int8>[10000] [0 0 0 ... 0 0 0] (hdd/cleanup* *1) '(:a)
We might as well interpret a Clojure set in the symbolic domain, returning a set, which is the same as the superposition.
(= (hdd/clj->vsa* #{:a :b :c}) (hdd/set (hdd/clj->vsa* :a) (hdd/clj->vsa* :b) (hdd/clj->vsa* :c)) (hd/superposition (hdd/clj->vsa* :a) (hdd/clj->vsa* :b) (hdd/clj->vsa* :c))) true
Set comes with intersection
and difference
:
(def sun (hdd/clj->vsa* #{:yellow :warm :space-object :round})) (def banana (hdd/clj->vsa* #{:yellow :food :tasty :fruit :banana-shaped})) (hdd/cleanup* (hdd/intersection sun banana)) '(:yellow) (hdd/cleanup* (hdd/difference (hdd/clj->vsa* #{:ice :cold :hard :water}) (hdd/clj->vsa* #{:cold :hard :stony :earth}))) '(:ice :water)
The simple associative map is the superposition of its key-value pairs (created by bind):
(= (hdd/clj->vsa* {:a :x :b :y :c :z}) (hd/superposition (hd/bind (hdd/clj->vsa* :a) (hdd/clj->vsa* :x)) (hd/bind (hdd/clj->vsa* :b) (hdd/clj->vsa* :y)) (hd/bind (hdd/clj->vsa* :c) (hdd/clj->vsa* :z)))) true
The cool thing is that you treat the set as if it were an element:
(hdd/cleanup* (hd/unbind (hdd/clj->vsa* {:a :x :b :y :c :z}) (hdd/clj->vsa* :a))) (:x)
Unbind releases a value from a key value pair, it also releases a value from the suerposition of key value pairs.
In porgraming in superposition, we general treat the whole as if it was an element. Pretend the forest is a tree so to say. 👈
Querying a map with the result of the intersection of something:
;; ------------------- (def fruit-domain (hdd/clj->vsa* {:lemon :yellow :apple :red :orange :orange})) (hdd/cleanup* (hd/unbind fruit-domain (hdd/intersection sun banana))) '(:lemon)
What Is The Moon of Saturn?
- "What is the dollar in mexico?" kind of thing, but with a directed graph and the values are a superposition
(def earth (hdd/->directed-edge (hdd/clj->vsa* [:moon :luna]))) (def saturn (hdd/directed-graph (hdd/clj->vsa* [[:adjective :saturnian] [:adjective :cronian] [:adjective :kronian] [:rings true] [:moon :mimas] [:moon :enceladus] [:moon :tethys] [:moon :dione] [:moon :rhea] [:moon :titan] [:moon :iapetus]]))) ;; let's say you know luna and you want to know what that is in saturn domain (hdd/cleanup* (hdd/edge->destination saturn (hdd/edge->source earth (hdd/clj->vsa :luna)))) '(:rhea :iapetus :tethys :dione :mimas :titan :enceladus) ;; 0. "What is the dollar in mexico?" kind of things work in general. ;; 1. the moon of saturn is a superposition 7 things ;; 2. Saturn is a composite datastructure, yet we pretend it's one element 'edge->destination' ;; works on an edge element, and the superposition of elements
Nondeterministic Finite-State Automaton, Bending Water
;; -------------------- ;; Nondeterministic finite-state automaton ;; -------------------- ;; - it can be in several states at once ;; - there can be several valid transitions from a given current state and input symbol ;; - It can assume a so-called generalized state, ;; defined as a set of the automaton's states that are simultaneously active ;; - a generalized state corresponds to a hypervector representing the set of the currenlty active states ;; - query the same way, is like executing the automaton in parallel (in superposition) ;; - cleanup will have to search for several nearest neighbors ;; ;; automaton in superposition (i.e. just query with states that are in superposition) ;; (def water-domain (apply finite-state-automaton (clj->vsa* [[:frozen :heat :liquid] [:liquid :heat :gas] [:liquid :cool :frozen] [:gas :cool :liquid] [:gas :heat :gas] [:frozen :cool :frozen]]))) (cleanup* (automaton-destination water-domain (hd/superposition (clj->vsa :liquid) (clj->vsa :frozen)) (clj->vsa :cool))) '(:frozen) ;; if your state is the superposition of liquid and frozen (cleanup* (automaton-destination water-domain (hd/superposition (clj->vsa :liquid) (clj->vsa :frozen)) (clj->vsa :heat))) '(:liquid :gas) ;------------------ (def water-bender-domain (apply finite-state-automaton (map #(map clj->vsa %) [[:frozen :heat :shards] [:liquid :heat :bubbles] [:liquid :cool :absolute-zero]]))) ;; now I have 2 automatons, (cleanup* (automaton-destination ;; ... superimpose them (hd/superposition water-domain water-bender-domain) (hd/superposition (clj->vsa :liquid) (clj->vsa :frozen)) (clj->vsa :heat))) '(:liquid :gas :shards :bubbles) ;; and we just run them in parallel, lol ;; stuff like that.
Superimposing 2 automatons (union) is equivalent to making 1 large one in the first place. It is somewhat suggestive, the primitives of a hyper interpreter might have this 'mixing' at the core.
More of this at fun with trees.
What Is the Bread For Lava?
;; ----------------- ;; ;; Pretend for a moment you have a cognitive system that already established certain 'samenesses' ;; (def liquid (hdd/clj->vsa :liquid)) ;; forgot about the honey ;; (def honey (hd/thin (hd/superposition liquid (hdd/clj->vsa :honey)))) (def butter (hd/thin (hd/superposition liquid (hdd/clj->vsa :butter)))) (def lava (hd/thin (hd/superposition liquid (hdd/clj->vsa :lava)))) ;; ------------------------------------------------------- ;; Model stuff with finite state automatons ;; ;; Say that the outcome of spread is a surface filled with *something* ;; ;; (def bread-domain (hdd/finite-state-automaton-1 (hdd/clj->vsa* [[:bread :spread {:surface butter}] [:bread :spread {:surface :vegan-spread}] [:bread :crumble {:crumps :bread}] [:bread :forget {:bread :molded}]]))) (def lava-domain (hdd/finite-state-automaton-1 (hdd/clj->vsa* [[:rocks :spread {:surface lava}] [:lava :freeze :rocks] [:vulcano :erupt {:spew lava}] [:rocks :forget {:rocks :ancient}]]))) ;; what is the bread for lava? ;; first ask what is the action I would do with lava and bread, ;; the similarity of lava and liquid doing work now, the shared :spread comes out: (hdd/cleanup* (hd/unbind bread-domain (hd/bind (hdd/clj->vsa* :bread) (hd/permute (hdd/clj->vsa* {:surface lava}))))) '(:spread) ;; the essential mechanism for this is at fun_with_trees.clj (let [the-action-that-would-lead-to-lava-surface-given-a-bread (hd/unbind bread-domain (hd/bind (hdd/clj->vsa* :bread) (hd/permute (hdd/clj->vsa* {:surface lava}))))] [:spread-lava-in-bread-domain (hdd/cleanup* (hdd/automaton-source bread-domain the-action-that-would-lead-to-lava-surface-given-a-bread (hdd/clj->vsa* {:surface lava}))) :spread-lava-in-bread+lava-domain (hdd/cleanup* (hdd/automaton-source (hdd/union bread-domain lava-domain) ;; for this to work, I need need to cleanup with ;; an item memory ;; (literature mentions this as challange. ;; Resonator networks can do this efficiently) ;; -------------------------------------------- ;; cleanup to prestine :spread (hdd/clj->vsa (hdd/cleanup the-action-that-would-lead-to-lava-surface-given-a-bread)) (hdd/clj->vsa* {:surface lava})))]) [:spread-lava-in-bread-domain '(:bread) :spread-lava-in-bread+lava-domain '(:rocks :bread)]
This example levarages the similarity
(the overlap) between lava and butter.
Also, you can query a union of finite state automatons and get the mix of automaton outcomes.
Observe that all other concepts like vulcano, erupt don't interfere with the query.
This is because we query in the surface domain. Since bind is randomizing, surface ⊙ lava
is
dissimilar to spew ⊙ lava
.
HDC/VSA expresses ambiguity, similarity and structure at the same time in interesting and joyful ways. 👈
Ambiguity Primitives
The Cortex is an information mixing machine - V. Braitenberg
A HDC software framework might or might not come with ambiguity primitives (just ideas).
everything
- similar to all other hdvs.nothing
- similar to no other hdvs.mix
- returns a new point that is similar to the inputs.vanishingly
- drops bits from input, making it less similar to input, allowing mixing.mostly
- ~(mix input-a (vanishingly input-b))
And so forth.
So far I have leveraged this only for the lava example.
Sparse Distributed Memory
Sparse Distributed Memory, [3] (SDM) is a conceptually biologically plausible random access memory. I highly recommend P. Kanerva for an explanation. Here, the explanation is extremely rudimentary.
From Douglas Hofstadter's forewored:
This book, because of its pristine mathematical beauty and its great clarity, excites me deeply when I read it, and I believe it will have the same effect on many others… Pentti Kanerva's memory model was a revelation for me. It was the very first piece of research I had ever run across that made me feel I could glimpse the distant goal of understanding how the brain works as a whole
I call it the 'alien spaceship' effect that comes from 'pristine mathematical beauty and great clarity'. It promises to be great building material for software, and it's a great pleasure for me to indulge in exploring such intuitions.
SDM maps onto the Marr-Albus theory of cerebellar function; Which is said to be the most convincing case of computational neuroscience informing the structure and function of a brain structure24. (Although one might have to say that it is perhaps the only halfway convincing example of computational neuroscience saying anything about brain structure and function).
The Problem: Given a high dimensional point a
(for address), reading from the storage should return the stored point d
(for data) from the storage.
A sufficiently similar point a-prime
should still return d
.
For RAM in current computers, each address location has a number assigned to it, called address, usually a number with a bit-width of the word of the computer (32bit,64bit,…) If I want to write, the address decoder activates the corresponding address (stored in the address register), and the circuit writes address location with the content of the input register. To read activate the address, and the circuit writes to the output register.
If the addresses are hyperdimensional (i.e. N >= 1000), having 1 address per word is not feasible.
Consider approaching the address decoder with a hypervector a
, it should give you a unique address location; This would mean it needs a location for all 2N addresses
of your hyperspace H
.
Quoting Kanerva 4: Even 2100 would be too many to fit into the human brain, as 2100 molecules of water would more than fill it (the number of neurons in the nervous system is "only" about 236).
The key insight is to use the similarity to multiple addresses (making this distributed) instead. Thus, the address decoder activates a set of address locations (called hard locations).
The second part of the design is to say that there are very few hard locations compared to the address space H
. Making this a sparse memory.
Asking an address decoder in SDM for a location given an address a
, returns you a set of similar address locations.
Each address location points to a string of bit counters (storage) that are as wide as your input and output (content) word (content matrix below). Each small rectangle here is ~ a hyper vector.
Instead of writing and reading from a single storage location, the user writes and reads from an ephemeral location, that is made from the ensemble (if you will) of the activated locations, which "cluster" around your address "A". (In the original SDM, this cluster is inside a hypersphere of the chosen Hamming distance between your address A and the hard locations).
If the similarity ('distance', i.e. the radius of the hypersphere) and address location count (for instance M = 1e6) is chosen well, then it is very likely that
for a given address A'
, which is similar to A, you get a similar location ensemble (my words) from the address decoder.
Say you write pink-content using address A
.
Say you wrote cyan-content into the ephemeral location B
, then when reading with 'A'-prime
:
For reading, we can sum the content in the hard locations and use a majority rule.
Implementation - Design:
Reprintet code comment:
;; Adapted from Kanerva 1993: ;; And mapping to neuroanatomy of cerebellum ;; ;; ;; ;; ADDRESS REGISTER WORD IN REGISTER ;; ;; N = 10.000 U = N = 10.000 ;; +--------------+ +------------------+ ;; x | | w | 1 |1 | 1 | | (sparse, segmented) ;; +------+-------+ +----------+-------+ ;; | | ;; | | ;; | N d y | U ;; +------v-------+ +-+ +-+ +----------v-------+ ;; | | |3| |1+----> | ;; | | |0| |0| | | ;; | A +--> |2| ---------> |1|----> C | ;; | M hard addr. | |0| threshold |0| | M x U counters | ;; M | | |1| |0| M | | ;; +--------------+ +-+ +-+ +---------+--------+ ;; | ;; | | ;; | | v ;; | | +-----------------+ ;; | | S | | sums ;; v | +--+-----+----+---+ ;; address overlap | | | | ;; ~d of [1,2] | | | | ;; | v v v (where S >= read-threshold) ;; activations <----+ --+-----+----+---- ;; ( 2 <= d) v v v top-k per segment ;; +-----+----+------+ (read-top-k) ;; z | | | | ;; +-----+----+------+ ;; s1 s2, ... segment-count ;; ;; ;; word out register ;; ;; 10.000 bits = segment-count * segment-length ;; ;; ;; ;; x - address word 'mossy fiber input' ;; A - Address Matrix 'mossy fibers -> granule cells synapses' ;; M - hard-locations-count 'granule cell count' ;; y - address-activations 'granule cell activations' ;; w - input word 'climbing fiber input' ;; C - Content Matrix 'parallel fibers -> purkinje synapses' ;; S - content sums 'purkinje inputs' ;; z - output word 'purkinje activations', or downstream purkinje reader ;;
For a Cerebellum mapping see Cerebellum Schema.
Address Decoder
It's input is an address
word x
of length N
, here (N = 10.000).
We match against the address-matrix
, made from M
(M ~ 1e6) hard address locations.
These are preconfigured, vectors of length N. In the original, they are made from bits where each is {1, -1} with probability 1:1.
But here, they are sparse {0,1}, and we look at the overlap with x
, resulting in an overlap (count) vector d
.
The sparseness of the address matrix makes this an example of an 'intermediate design' SDM7,25
We cut off d
at the decoder-threshold
yielding the address-activations
, representing the hard locations that are on.
Observe that a lower = decoder-threshold yields a larger number of addresses. There is an optimal count of address locations (the activation probability).
Which is important but I don't discuss it here for brevity.
Content
Write:
Increment the counters in the content matrix C
, cap it with a counter-max
, at all the overlap positions of the activated address locations and the input word (length U
).
If U
equal N
, this can be used as auto-associative memory.
Read:
Given address location activations, sum up the counters of C
, selecting the activated rows.
In the original SDM design, take the sign (majority) of the resulting sums and call it the output word.
Here, I added steps to get a sparse segmented binary hyper vector out. This takes a top-k
for each segment.
If top-k
is equal to 1, the outcome is usually maximally sparse - i.e. there is one 'on' bit per segment.
If top-k
is > 1, then a sumset of outcomes is returned, allowing the user to query for multiple continuations of their query at the same time.
See What Is The Meaning of read-top-k > 1?.
In an auto associative usage mode, the user decodes with address a
, then writes into the activated locations the content a
.
The user can query with a'
to clean up back to a
.
converged lookup
This can be done iteratively, where each subsequent a''
is even more similar to a
.
This is a very surprising thing about the SDM. The initial required similarity is called critical distance 4.
This allows one to rapidly probe the memory, saying whether an item is stored, or not. If the iterative outcomes converge (very few steps like 2-3 here), the chance to converge on an actual stored item is astronomically high.
The cognitive connotation is akin to a tip of the tongue situation, where the user is sure they must "know", even though the remembering doesn't work. Conversely, one can extremely quickly and I would say with utter conviction say whether one has any chance of remembering something. One "knows" one knows or knows that one "ought to know" something.
Implementation Concrete
Using libpython-clj and pytorch, I was able to build a personal SDM at the Clojure Repl.
This utilizes the pytorch sparse tensor feature, which allows us to have a large amount of hard address locations (M)
because address-matrix
is allowed to be extremely sparse. And content-matrix
is small at small datasets. (T < 10.000).
(M = 1e6, N = 1e4) is handled without a big GPU. (GeForce GTX 1050 Ti, only 4096MiB RAM).
This serves as an example of "neural modeling" using pytorch and libpython-clj.
Code is here: sdm.clj (quite raw, interfaces not figured out and so forth, ad-hoc datatransfers python-cuda-jvm etc.).
What Is The Meaning of read-top-k > 1?
- Querying with top-k > 1 results in a
set
of "possible values" for the query. - The mechnaism for this is what I call unmasking secondary or latent storage counters, which then contribute to the outcome.
- This is similar to an excitability iceberg [Yuste 2022]. In this case, top-k represents roughly the excitability of purkinje cells.
- This is akin to querying for the halo26 of a neuronal ensemble.
- Roughly: Modifying the address decoder threshold increases the radius hyper sphere, modifying the query top-k is saying how many centers to query for.
- Put in another way, the top-k paramater is like the water we fill into the attractor landscape, which is bounded by the address location.
- Increasing top-k to segment count is like filling the water so the complete landscape is covered, where values in between give intermediate 'amounts of contributing outcomes'.
- I.e. querying with top-k = 3 returns a concept cloud of 3 related hypervectors. Querying with top-k = 1 is akin to querying for an ensemble center.
- In Hebbs terms that is fringes and cores. In Braitenberg terms that is halo and centers.
- In light of Attention Approximates Sparse Distributed Memory, this suggests a possible way of updating transformer architecture, this would correspond to having multiple values per attention key.
- highly speculative idea: Interestingly, this would give an interpretation for quantized inhibition, in that case the output words of Cerebellum would be discrete superpositions of an elementary neuronal letter/word.
(def m (sparse-sdm {:address-count (long 1e5) :address-density 0.00003 :word-length (long 1e4)})) (def d (hd/->seed)) (def d-related (hd/thin (hd/superposition (hd/->seed) d))) (do (write m d d 1) (write m d-related d-related 1)) ;; top-k = 1 I get something out that is ca 50% d (hd/similarity d-related (torch->jvm (:result (lookup m d 1 1)))) 0.75 ;; and 100% d (hd/similarity d (torch->jvm (:result (lookup m d 1 1)))) 1.0 ;; with top-k = 2 you get both of them out, ;; querying with either of them ;; ;; intuitively, this is because they ;; share 50% of their address locations. So ;; those bit counters are high, but not the ;; highest. with top-k > 1, we *unmask* the ;; relatively high ;; ("secondary") bit counters, this returns us ;; the superposition of *related* addresses ;; (hd/similarity d (torch->jvm (:result (lookup m d 2 1)))) 1.0 (hd/similarity d-related (torch->jvm (:result (lookup m d 2 1)))) 1.0 (hd/similarity d-related (torch->jvm (:result (lookup m d-related 2 1)))) 1.0 ;; to sanity check, a random seed (not stored) ;; will not result in anything resembling ;; anything else, even if top-k is 'high' (let [e (hd/->seed)] (hd/similarity e (torch->jvm (:result (lookup m e 10 1))))) 0.05 ;; but top-k == segment-length actually results ;; in 'everything' (this is akin to epilepsy at the purkinje cells) (= (let [e (hd/->seed)] (hd/similarity e (torch->jvm (:result (lookup m e 500 1))))) 1.0) true (= (let [e (hd/->seed)] (dtt/->tensor (torch->jvm (:result (lookup m e 500 1))) :datatype :int8)) (hd/->ones)) true
Unmasking, roughly:
Divergence In Hetero Associative Mode
This is more interesting in a hetero associative usage mode.
(do (alter-var-root #'hd/default-opts #(merge % (let [dimensions (long 1e4) segment-count 20] {:bsdc-seg/N dimensions :bsdc-seg/segment-count segment-count :bsdc-seg/segment-length (/ dimensions segment-count)}))) (def m (sparse-sdm {:address-count (long 1e5) :address-density 0.00003 :word-length (long 1e4)})) (def d (hd/->seed)) (def d-related (hd/thin (hd/superposition (hd/->seed) d))) ;; heteroassociative: ;; d->a ;; d-related->b (def a (hd/->seed)) (def b (hd/->seed)) (do (write m d a 1) (write m d-related b 1)) ;; sanity check, the address content should be different from the address (hd/similarity d (torch->jvm (:result (lookup m d 1 1)))) 0.0 ;; query with ;; top-k = 1 [:a (hd/similarity a (torch->jvm (:result (lookup m d 1 1)))) :b (hd/similarity b (torch->jvm (:result (lookup m d 1 1))))] [:a 1.0 :b 0.0] ;; top-k = 2 [:a (hd/similarity a (torch->jvm (:result (lookup m d 2 1)))) :b (hd/similarity b (torch->jvm (:result (lookup m d 2 1))))] [:a 1.0 :b 1.0] ;; ... look I get the superposition of both a and b, ;; even though a and b are unrelated, they are possible continuations of `d` ;; but `b` is only unmasked secondarily, in a more 'divergent mode' of querying. )
- Cognitive connotations: It might help model "possibility clouds", or "concept halos", reminiscent of cognitive fluidity [Mitchel, Hofstadter27].
- Biological Remarks
- Such a "query density" surely would be supported by some of the inhibitory interneurons of Cerebellum.
- Lowering the excitability of the Purkinje is akin to querying with low top-k. Increasing the Purkinje excitability is akin to querying with a higher top-k.
- As noted, on one extreme, this would would be epilepsy at the Purkinje cells. On the other extreme, it would be too much inhibition of Purkinje cells.
- Fast conducting parallel fibers might implement a fast feedforward 'sheet' inhibition (selecting subsets of purkinje, i.e. something that looks like the top-k here), it would make sense if the inhibition of Purkinje cells is proportional to granule cell activation.
- Biological Remarks, The Address Matrix
- The Cerebellum sdm mapping would predict that the address matrix is static.
- The granule cells should not have synaptic plasticity, you might even expect processes that keep the mossy-fiber->granule cell synapses stable.
- The granule cells would be allowed to not fire for the most part, yet not be prunend etc.
It turns out that this is wrong! Mossy fiber -> granule cells make LTP.
Sequence Memory - First Order
—
The code comments here, while raw in some ways, reflect a complete thought process of designing a k-fold sequence memory. Arguably, what folllows here is a terser version of the code.
—
A simple way to encode a sequence [a,b,c,d]
in an SDM is to encode each pair a->b, b->c and so forth as an address-content pair.
So using the memory in a heteroassiative mode, where the outcome of query 1 is in turn address for query 2 and so forth.
;; ;; a +--> b +-> c ;; +-----+ | +-----+ | ;; | |--+ | |---+ ... ;; +-----+ +-----+ ;; ;; k = 0 , k = 1 ;; ;;
- This is akin to a pointer-chain or linked-list.
- This breaks down the moment we share an element x across multiple sequences in the dataset.
Sequence Memory - Higher Order Via K-fold
- Parallel fibers are slow conducting, this would overlap with the need for a delayed address decoder in a k-fold memory.
- The Cerebellum as a kind of clock [e.g. Braitenberg 28], is the second main "cerebellum theory", but out of favor recently.
See design and implementation.
I reproduce the code comment here:
;; Version 3: ;; ;; K fold memory ;; ;; - In [2] Kanerva is building up the principle with a vocubulary, it is reproduced here, not completely but mostly ;; and with somewhat different notation that fits the context here: ;; ;; (T is the dataset, like everything the machine encounters). ;; ;; - kth-Order Transition: 'A sequence occuring in xs(T) is a subsequence of xt(T) of length k + 1 ;; For example [a,b,c,d] is a third-order transition; it says that the three-element sequence [a,b,c] is followed ;; by d.' ;; ;; - j-Step Transition: ;; A pair of elements in xs(T) separated by j - 1 elements. For example [a,d] is a three-step transition, ;; it says that a is followed by d. ;; A first-order transition (here, the 'pointer chain' Version 1) is a one-step transition. ;; We already see that j-step transitions are easy to store in random access memory. ;; Store `d` with `a` as address. ;; ;; - j-Step Memory: ;; A (sparse distributed) memory sdm-j[j] that stores the j-step transitions of xs(T). ;; The data retrieved from reading with `x` can be denoted data-j[j,x], ;; or `(lookup-xs sdm-j[j] x)` ;; It is further assumed that the data become avalable j time steps after the address x has been presented to the memory, ;; so that the memory has a built-in delay. 👈 ;; ;; - k-fold Memory: ;; A k-fold memory is a set of k j-step memores {sdm-j[1], sdm-j[2], ..., sdm-j[k]}. ;; The jth memory, sdm-j[j] is also referred to as the jth fold. ;; ;; - k-fold Data: ;; The k-fold data at time T, D(T), are the multiset of the data available at time T. ;; ;; - k-fold prediction: ;; To summarize (vaguely): ;; ;; Consider having read `a` from a one-fold memory: ;; ;; [ a, ? ], ;; ;; the data at hand will be `b'`, and the prediction input came solely from the addresses read with `a` at time t - 1. ;; ;; In a k-fold prediction, reading at ;; ;; [a,b,c,d,?], ;; ;; The addresses read t - 1, t - 2, t - 3, t - 4 all contribute to the prediction of `?` ;; If T includes sequences: ;; ;; [a,b,c,d,e] xs1 ;; [j,k,l,d,f] xs2 ;; ;; Then, the k-prediction will have roughly ;; ;; 1/k contributions from the xs2 for `f`. ;; ;; 1/k * 4 contributions from xs1 for `e`, ;; ;; The data will include overwhelmingly more of `e`, than `f`, so `e` is the 4-fold prediction for [a,b,c,d,?] ;; ;; ;; --------------------------------- ;; The delayed address decoder: ;; --------------------------------- ;; ;; 1. split all addresses into `k-delay` buckets ;; So that each address has an associated `delay`. (0 < `delay` <= `k-delays`) ;; (this can be done by delaying the input to the address decoder, or its output lines, ;; which would fit the neurophysiology of the parallel fibers - they are slow conducting). ;; ;; ;; ;; decoding: ;; ;; When decoding address `a` at time `t0`, ;; activate address locations `a0` at time `t0`, ;; then `a1` at time `t1` and so forth. ;; ;; ;; ;; a ;; +----------- delay lines (~ slow conducting parallel fibers?) ;; | | ;; | | ;; v v ;; +----------+ +-+ +---+ +---+ ;; | ----- | ----+-+-----| |--------------> | | a1 ;; | | |1| | | | | ;; | ----- | ---+-+--> | | a0 | | ;; | | |0| | | | | ;; +----------+ | | | | | | ;; address decoder | | +---+ +---+ ;; ^ | | | ;; | | | | ;; | +-+ | ;; addresses delays v ;; ;; ;; activations ;; {a0} {a1} adress-location activation set ;; ;; ;; t0 t1 ;; ;; ;; 2. When decoding, the resulting activation set is the union of active locations, ;; Including the ones activated (1...k) steps in the past. ;; ;; ;; t0: ;; decode with `a` ;; ;; a ;; ;; +-------+ +---+ ;; | | | | ;; | A +-------> | | ;; | | | | ;; +-^-----+ +---+ ;; | ^ ;; | | ;; future state: { a0 } activation set (now) ;; [ a1, a2, a3, ...] ;; ;; ;; at each times step, ;; some a-lines will be on in the future ;; ;; ;; ;; ;; t1: ;; the a-lines are still on for a while ;; decode with `b` ;; ;; ;; b +---+ ;; | | | ;; v | | ;; [ A ] ---------> | | ;; ^ | | ;; | +---+ ;; future state: ^ ;; [a2,a3, ...] | ;; [b1,b2, ...] { a1 , b0 } activation set (now) ;; ;; ;; ;; ;; ;; ;;
- The key insight is that a delayed address decoder allows us to smear out the query across multiple pieces of a sequence.
- Both sequences and auto-associative iterative lookups are shown to converge, if the initial address
a'
is sufficiently similar, inside the critical distance from stored dataa
.[3]
reset
can be achieved by querying with nothing or non-sense for ~ 7 steps.
Future
- Figure out how to make hierarchical sequences.
- Perhaps a music analogy will help. Perhaps one encodes nested sequences in terms of something like rhythms. (Like Buzsáki suggests 2)
Cerebellum Schema
[Based on wikipedia]: (just as reference)
- The count of Granule cells is a bit of a meme, along the lines of "The brain has 30 billion neurons and 50 billion of those are granule cells".
- 3/4 of the number of neurons are cerebellar granule cells. (Out of the window goes the idea that many neurons are what makes Neocortex smart.)
- Input to granule cells comes from mossy fibers.
- Mossy fiber inputs come roughly from the rest of the brain for present purposes.
- Granule cells outnumber mossy fibers 200 to 1.
- Granule cells make four to five dendrites (so each granule cell receives input from 4-5 mossy fibers).
- Each granule cell makes one parallel fiber axon that makes a T shape and goes straight through the molecular layer.
- The molecular layer is made from orderly slaps of Purkinje cell dendrites.
- Purkinje Cells make a very impressive tree-shaped but slap-like dendrite, and the molecular layer is mostly made out of those.
- Each parallel fiber contacts one of 3-5 Purkinje dendrites it passes and makes a total of 80-100 synaptic connections.
- Each Purkinje dendrite receives 200.000 granule inputs
- The sole output of the Cerebellar Cortex is the Purkinje axons, which happen to be inhibitory, making synapses on deep cerebellar nuclei
- (roughly: Deep cerebellar nuclei go to Thalamus then Cortex).
- The proposed module of the cerebellum is a strip of 1000 Purkinje cells,
- receiving from ~ 100 million parallel fibers.
- outputting to a small cluster of ~ 50 neurons in the deep cerebellar nuclei.
- The parallel fibers are slow conducting, unmyelinated. They are among the thinnest known vertebrate axons.
- And 'represent an extreme anatomical adaptation'.
- There are 2 main ideas for this 1) The pf's are adapted to be compact so that the count of pf->purkinje is optimized, OR/AND 2) the pf's are adapted to make delay lines [Braitenberg 1978], which would mean that the cerebellar cortex is computing something in time.
- (not a timeless calculation, functional programmers will appreciate).
- The second input to the cerebellar cortex comes from the inferior olive, which makes one climbing fiber per Purkinje cell (settles during development).
- The climbing fiber for the Purkinje branches and curls around its dendrite, making a multitude of synapses, it makes the strongest excitatory synapses in the brain.
- Clearly, there is a second input line, which has a 1:1 mapping to the output lines (output lines are Purkinje axons, second input lines are climbing fibers).
- This 1:1 correspondence was what allowed Kanerva to map Cerebellum to SDM. He understood that this was required for memory (input and output 1:1).
—
To map SDM and Cerebellum, imagine that the content matrix is the molecular layer, and imagine that the parallel fiber activations are y
.
Observe that in read mode Purkinje dendrite can sum up its inputs from its pf-synapses (~ S
).
Then, per a given "segment" of Purkinje, some can be active (output word).
In write mode, the climbing fiber can activate a set of Purkinje (input word). Purkinje synapses can be put into a "transient receptive mode" (I made up this word, but I am alluding to complex spikes, see literature), where all of the input modifies the strength of the synapses akin to updating the bit counters. This happens to be negative in Cerebellum, but Purkinje is also inhibitory so I suppose somewhere 2 negatives are canceled out in this particular arrangement.
Conclusion
I argued that the The Address Decoder Neuron is a special case of a reader-centric cell assembly [Buzsáki12].
I made a sparse distributed memory (SDM), 3 extending the design to accommodate BSEG hyper vectors; Using a BSEG-VSA allowed me to uncover a "divergent" query mode for this design, where multiple values are returned given a query. In short, the read operation allows a top-k query, instead of a majority rule. Underlying that in general, sparsity makes working with superpositions simple.
I implemented a k-fold sequence memory via an delayed address decoder, and I suggest that the slow-conducting parallel fibers of the cerebellum might be a biological implementation of a delayed address decoder. Thus, I would argue that a k-fold sequence memory view on Cerebellum meets the high bar of biological plausibility set by SDM. And explains why the parallel fibers are slow conducting, which was a missing, open question. I believe this should be a way of unifying clock-centric and learning-centric Cerebellum theories.
Outlook (Notes)
- Check out resonator networks.
- Be inspired by neuroscience [e.g. 2, 5], and explore conceptually what a neuronal interpreter, neural syntax, neural word could be.
- The basic idea of sequence processor Hyperlisp V0, was to program with something like multiverse and collapse.
- In Lisp, the 'first' element is the verb, but in HDC/VSA-Lisp, any part of the composite data structure could be a verb, or multiple verbs (superposition). (also Neuronal Syntax).
- We know from data-oriented programming [Sharvit 2022] 29 that a software framework can leverage the properties of its data structures - it doesn't necessarily have to be an interpreter.
- It would be interesting to work out the mathematical relationship between Population Vectors and address decoders.
- modeling effector neurons as address decoders.
- Implement a 'slipnet' [Mitchell and Hofstadter], attempt to solve the copycat domain, an analogical reasoning challange.
- Explore ways of combining HDC with cellular automata, neuronal ensembles, chemical computing.
- For instance, one might map an hv into a neuronal area like this. This could serve as an 'active focus'. Using Kanervas 3 brain design.
Update:
- Blog post already in the making for 2 months.
- In the meantime I started using this stuff a bit more: The analogy story arc.
- I guess that is for further blog posts.
Current state:
"What is the ABC That Starts With J?"
(into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:a :b :c] :j))) [:j :k :l] (into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:c :b :a] :z))) [:z :y :x] (into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:c :d :c] :j))) [:j :k :j] (into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:j :k :j] :o))) [:o :p :o] (into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:j :k] :o))) [:o :p] (into [] (map hdd/cleanup (what-is-the-abc-that-starts-with-j [:c :b :a] :z))) [:z :y :x]
Tiny "If abc got changed to abd, what happenend to jkl"
(tiny-if-abc-got-changed-to-abd-then-what-happenend-to-jkl [:a :b :c] [:a :b :d] [:j :k :l]) '(:j :k :m) (tiny-if-abc-got-changed-to-abd-then-what-happenend-to-jkl [:b :b :b] [:a :b :c] [:j :j :j]) '(:i :j :k)
Appendix
Clojure
Clojure is a highly regarded, functional Lisp hosted on the JVM. It combines the strengths of classical, symbolic, interactive programming of Lisp and Scheme, with the full power of homoiconisity, extendibility via macros, and "having the language always there".
It inherits minimalism and emphasis on functional programming from Scheme while being practical by fully integrated on its host platforms. Thus making it possible to have a Lisp for virtually all modern use cases thinkable.
It improves on Common Lisp by being implemented in terms of abstractions, with powerful abstraction imprimitives30 not the concrete cons-cell.
And by providing a powerful set of default, immutable, persistent data structures, which are deeply utilized in every aspect of Clojure programming,
including code, which is expanded to utilize vectors and maps, next to lists. (Clojure code contains []
and {}
for binding and destructuring).
Data-oriented programming [Sharvit 2022] 29, centering on Clojures immutable data structures can be seen as a continuation31 of dynamic Lisp programming with functional style [e.g. Graham 199332].
Biological Remarks On The Robustness of The Address Decoder
The compression of dimensionality from address to address decoder is reminiscent of the pointers in semantic pointer architecture [Eliasmith]1.
In G. Buzsáki 2, a very similar thing is hippocampal trajectories, which are preallocated, rich hub neuronal ensemble sequences of the hippocampus. They are more like snippets of music, put together from a pool of notes.
Roughly, many cognition theories include something like "Low dimensional", low detail, abstract, essences, situations [Hofstadter], categories [Feldman-Barret], analogies, frames [Minsky], prototypes [Kanverva 1], or trajectories [Buzsáki], which must be filled with high dimensional, high detail, concrete "data". Called fillers or terminals at 33.
Note that "low dimensional" is in quotes, because it is a relative statement.
Hearing the voice of a person, say 'Bob', always sounds like the same voice, even though the particulars of the sensor data must be quite variable. In Hofstadter's terms, we make an analogy between the instances of hearing Bob's voice. We understand/categorize/make the analogy / analyze the situation / find the essence of the 'hearing Bob's voice situation'.
The same thing is still true while a rainy thunderstorm is overhead, and both people must raise their voices to be heard. This fits with our observation of the robustness to noise that the address decoder neuron exhibits.
And further, I can "bring" the voice of my friend Bob "to mind", perhaps not or perhaps even concretely "internally listening" to a snippet of auditory sensor data. Even then, the nature, or essence, or gestalt, or seed vector, or inner marble, or gravitational center of meaning [Dennett], or symbol of Bob's voice is quite distinct and seems to be a perfectly valid concept for the mind to deal with.
This has led many thinkers to the conclusion that whatever represents a Bob voice situation can exist independently of Bob voice sensor data. This internalizing (the opposite of externalizing), this detachment from the outside world, this counterfactual planning, or imagining a possible future, or entertaining an idea, or thinking of a thought, or mental navigation2 of mental content is theorized to be supported via neuronal ensembles (Neuronal Ensembles, 5).
The Cell Assembly Hypothesis one might call conceptual neuroscience going back to Hebb [1949] (and before), roughly saying that ensembles make mental content. This is only useful when we can answer questions like how many ensembles are there?, what timewindow does an ensemble have?, how are ensembles structured? and so forth12.
In Capgras delusion, the user suffers delusional misidentification. Say Allice suffers an accident and will thenceforth think that her husband Bob is not Bob,
but Bob-imposter. Believing something strange like "an alien imposter was installed to replace real-bob".
Perhaps Alice lost her marbles, or perhaps there was a hyperdimensional symbol bob
, which contributed to the gestalt and identity of her perception of Bob.
But bob
is no longer activated for some reason. Instead, there is another symbol bob-2
.
In hyperdimensional computing, without further assumptions than randomness, bob-2
is as different from bob
than any other person symbol, indeed any other concept symbol.
It is sometimes said that the first impression of a person counts most. Perhaps because we "allocate" some kind of "prototype" "referring" to a person,
and the context somehow ends up contributing to its gestalt, basically forever. Perhaps bob-2
underwent the equivalent of a first impression in Alice's mind.
Attention Approximates Sparse Distributed Memory
- It's very technical. I can recommend Trenton Bricken's talk, it also gives a nice overview of SDM in the first place: 6.
- Bricken shows how the overlapping hyperspheres of SDM are approximated by the softmax in Attention.
- Both designs amplify the differences between addresses roughly exponentially.
- This is then corroborated by the fact that trained transformers converge on plausible values for SDM parameters. (In SDM, address activation probability is an important factor, in deciding between noise-signal-capacity tradeoffs.)
- Trained transformers have a beta value for their softmax that roughly optimizes query-noise. I.e. it tolerates noisy queries, trading off memory capacity.
- Attention keys - SDM addreses locations,
- Attention query - SDM address word
x
, - Attention values - SDM content,
~S
, I guess. - This is quite remarkable, then SDM was pre-empting "Attention" by ~30 years.
- It's really strange to "learn" your memory via backpropagation.
- It is like trying out what the system would do if it would have memory x.
- It's like filling a USB stick by shaping it's content. I find it strange.
- That the alternative has to do with having a dynamical process that actually uses a memory by storing and retrieving lies at hand.
- This suggests that "Attention" captures some essential properties of the computation of the Cerebellum, and that the power of Transformers comes from some computational trick(s) that the Cerebellum is doing, too.
- Note that so-called cerebellum-like structures are found across the animal kingdom, review. In flies they are Mushroom bodies, Are mushroom bodies cerebellum-like structures?. In electric fish, they enlarge evolutionarily alongside electrosensing. There is also one kind of fish with a gigacerebellum, basically the brain of this fish is made from a big cerebellum. I find this funny.
Neuronal Ensembles
Von Der Malsburg:
Attractor nets are the data structures of the mind. They form a construction kit for our mind states. They are the Lego blocks of the mind. The chunks of cognitive theory. And the syntax, the grammar under which they combine is: Any combination of coactive networks has to form a stabilized, self-organized, attractor net, that would be, given enough time, stable under self-organization. This is in principle a very simple system.
Vectors of Cognitive AI: Self-Organisation
These are all the same thing;
Attractor nets, subnetworks, population vectors, reverberating chains, neuronal ensembles [Lorente], Cell Assemblies [Hebb].
Lorente de Nó (1938) 34 called them chains or ensembles (I will use the term neuronal ensemble), for reviews: 35,36. These are conceptual recurrent connections in neuronal networks that could be activated on their own, thus shedding light on the mechanism and function of internal activity of the brain.
Donal Hebb [1949] called them Cell Assemblies, in his theory, they are self-sustaining, self-activating subnetworks, and the connections between them are allowed to represent causality. If A goes to B, then B follows A. The cell assemblies together with the proposed local learning rule, Hebbian Learning is called Hebbian Theory. Neurons that fire together, wire together - the slogan is somewhat misleading. It is the neurons that cause each other to fire that wire, it takes 2 timesteps to do that.
The Organisation of Behaviour 1949 is a classic; I recently found that Daniel Dennnet mentioned in his memoir37 how Hebb influenced his thinking.
Figure 1: A Hebban Cell Assembly (conceptual) is a group of neurons activating each other.
A Hebbian Cell Assembly is a group of neurons that activate each other. Time step 1 the cyan neuron activates the second neuron then so forth, in a chain. They are in something of a "causal alliance", each causing the rest to fire. In that case, such activity would be self-igniting (an old term that I like) or self-stabilizing. If some of the neurons are active, the whole should be recovered, hence it is auto-associative.
A Hopfield net [Hopfield 1982] is a very simple formalism for recurrent neuronal nets inspired by magnetism, it can be used as an auto-associative memory, when used together with a Hebbian Learning Rule. Hopfield Networks are a special case of Sparse Distributed Memory. Thus, the history of HDC, neuronal nets, connectionism, Parallel Distributed Processing [McClelland, Rumelhart], and related ideas have much overlap, Kanerva reviews: 1.
elements and substrates
Note how we are thinking here on the level of the activity on top of neurons. These are elements that perhaps are better seen as some kind of icebergs that float on top of the substrate the neurons make. When they move, they flow through the neurons, made from activity. When they stay, they are circular, self-sustaining activity, like an activity that carved out a subnetwork, or made pocket in the net; Although they might be spread out in time and space, too. Like high dimensional dominos falling 'clack-clack-tick' around the network. "Neurons working together" doesn't quite capture it in my opinion. The key idea is that the activity creates a new emergent entity, the ensemble. Thinking about the ensemble, we disregard the neurons. Like when thinking about the forest, we disregard the trees.
Now you can regard the forest as a kind of entity; For instance, it sort of has tentacles or fingers of trees growing roughly outside of its edges and so forth (The details come from ecology).
The properties of water come from the randomness of the elements. But water is a very predictable substrate.38 Randomness at the neuronal level is very useful, it makes the ensembles more dynamic. Such a randomness or babble at the bottom comes up in many theories of self-organization and emergence.39
Randomness is an example of something that makes sense on the substrate level, not the element level. I think this 'the properties of the substrate' idea is a useful way to think of emergence.
Braitenberg 1978 40, and G. Palm 198241 elaborate on the temporal dynamics (temporal structures) of such assemblies. Thus, perhaps an assembly encoding for a piece of music, or a muscle movement would have dynamic parts, as well as static parts.
heterogeneous assembly
- a heterogeneous assembly has multiple sub-centers. E(A) ("excited A")
- The opposite is a homogenous assembly, where all connections would have the same strength.
- Biological intuition tells us cell assemblies should be heterogeneous as a rule.
- If one would reduce the excitability, one would select one of the sub-assembly centers of
A
. - The other way around, if one takes a small assembly
A1
and increases the excitability, one ignitesE(A1)
, - Or
A1
+ halo(A1). - The halo is something like the city outskirts of A, somewhat connected, but loosely.
- That is a larger cell assembly with multiple centers.
Thought Pump:
- Activate the halo of the assembly by increasing the excitability.
- By decreasing the excitability, the assembly would narrow down into one of its centers.
- This depends on a threshold device, something that regulates the thresholds (excitability) of all neurons in question "area"?, column?, whole cortex?, 'sheet wise', claustrum?, basal ganglia? thalamus?, not figured out.
- It should select the best-connected sub-center (given ideal inhibition, see below).
- This move then, could be done at the speed of 2 neuron timesteps, and it could be a parallel search process.
- Jumping from thought to thought, creating a 'train of thought', hence 'thought pump'.
Figure 2: Concept of a Cell Assembly with halo. The halo only only activates if the excitability of the neuronal substrate is high.
- If the assemblies are 'settled' and "don't move" anymore, perhaps this could be interpreted as the 'best reachable interpretation' for the system.
- This remains to be purely conceptual afaik. But is a cool idea to have in any case.
Since you are scrolling through already, check out this work on visualizing much of the activity in a zebrafish larva:
Zebrafish larvae are translucent for the first time in their life.
Scientists at the Howard Hughes Medical Institute studied live zebrafish larvae that had been genetically encoded with a calcium indicator called GCaMP5G. They suspended the larva in a gel and then beamed it with lasers. Just before a neuron fires, its action potential is expressed via a spike in calcium ions, so when one of the genetically modified larva's neurons reached its action potential, it glowed. This showed the researchers the firing of the neurons without them having to attach a bunch of electrodes to the fish. Over the course of an hour the researchers used laser beams to scan the larva every 1.3 seconds, exciting the retina of the zebrafish with each scan. This microscopy method allowed the researchers to record up to 80 percent of the fish's 100,000 neurons at single-cell resolution. This is the first time scientists have recorded such a high percentage of an organism's brain activity at such a high resolution.
I'm not saying it makes thought pumps, but something incredibly dynamic is happening either way.
—
In Braitenberg 1978 42 I feel like they become animals made from 'patterns', dynamically dancing, flowing, stretching, like amebas, in the rhythms of music, carrying information, working together to activate remote associations and things like that.
It is a small step to consider them little software animals. When they are information that reproduces, they become memes. If their effect is to be active again, they have causal power or competence without comprehension [Dennett].
Moshes Abeles 1982 was one of the first to do neurophysiology and find something like ensembles (note they were entirely conceptual up to this point!), his theory is dubbed Synfire chain because he and his collaborators emphasize how it is hard for the activity to survive on its own, hence the activity of synchronous neurons could carry forward in a net. Like a forest fire but the forest is sort of wet. And then you have groups of activated neurons going in a chain, or back and forth.
Papadimitriou, et.al. describe a model using Hebbbian Learning43. They proved it to be Turing Complete. (This is considered another candidate for programming neuromorphic hardware.) These are very Hebbian Assemblies. With sheet-inhibition (implemented by a simple cap-k) and a recurrent neuronal net and Hebbian plasticity. From this emerge assemblies, which are auto-associative. They have coherently activated subnetworks of neurons, i.e. attract subnetworks.
For learning, just like with a Hopfield net, you present a sensory stimulus, which you must project or embed into your neurons.
Then, by presenting a part, the whole can be recovered, hence auto-associative.
By presenting 2 stimuli, an overlap assembly will form. By presenting a sequence, hetero associative assembly(s) will form. I imagine this as sort of a tunnel that an assembly with temporal structure takes:
A -> B -> C
If A has many connections to B, then B will fire after A. This causality notion is a major thing coming from Hebb's theory.
Such a flow of activity is akin to water flowing through a riverbed. If it goes in a circle, this would also be called a ring attractor.
Essentially, such Hebbian auto-associative subnetworks were already discussed in G. Palm 1982 41. Nothing is new under the sun.
—
Rafael Yuste about the electrophysiology of the neuronal ensembles. They are not Hebbian in Cortex
Summarizing:
- They can artificially produce ensembles via opto genetics, it is as if neurons that fire together are glued together (for days!).
- They can make a mouse behave as if it were seeing x, by activating the same ensemble as when it was seeing x. (suggesting ensembles indeed are internal units of perception).
- They can activate ensembles via single neuron activation (remarkable I think). But this only works 60% of the time.
- They find that cortical ensembles are not Hebbian44. Not fire together wire together, but everybody active has more excitability. (It's an even simpler, even more local rule, hence very satisfying).
- The important lesson: the causality structure encoding things (the main point of Hebbian Theory) of the brain comes from something else.
—
György Buzsáki12 provides a cleaned-up definition. The reader-centric cell assembly. His neurophilosophy emphasizes truly satisfying, mechanistic, first principles of thinking. Buzsáki initially wanting to be a radio engineer, sheds light on his work. There are oscillators, rhythms, syntaxes, reader frames, analogies to music. Quite captivating.
Buzsáki's empirical work suggests that we are endowed with pre-allocated symbols, and that nothing is new to the brain, we match our internal symbols (which are more like snippets of music) to the trajectories of our actions and the world.
Buzsáki also suggests certain departures from current neuronal modeling that I am a fan of. The brain is primarily maintaining its internal dynamic. Learning doesn't much change those patterns. This does away from the (essentially behaviorist) notion of the brain as a receptacle of knowledge. Knowledge creation is an intrinsic, self-generated activity.
The primacy of action - The first nervous systems simply moved bodies around and sensors are a secondary invention.
Neurons that wire together fire together.
The meaning of this is that during development, sub-networks of neurons are allocated by growing in the same cohort (~ same day). These subnetworks are stable cell assemblies of the brain - mostly pre-allocated. Like the keys of a piano, it's their composability that matters. This also fits the notion that the brain is a robust system.
The brain is a spaghetti.
Action is the only second opinion. - A cognitive system can ground its internal symbols via action. (How to do this in machine intelligence is a further question, part of the motivation to pursue such things as Sequence Memory - Higher Order Via K-fold, Kanerva also has a shot at a design in 3).
This inside-out philosophy is an answer to David Deutsch's critique of AI (which he says is virtually the oposite of AGI), the idea must be first. The ideas must be essentially random, generic or 'for-free' at the bottom; Tying it together with Kanvervas path of least assumption, asking what you get from assuming nothing but randomness.
More on this in the future, as I want my AI to be essentially inside-out.
Hyperlisp V0
This was an early experiment, on the shelf now.
This a sequence processor where all expressions are hypervectors. Where mix, bind etc. are primitive operations. Where symbols are multisymbols and multiple things are evaluated when a multisymbol is encountered.
(cleanup* (h-eval (h-read-code (mix :heads :tails)))) (:heads :tails)
(mix :heads :tails)
is the superposition of it's inputs.
Symbols can be multisymbols. If the evaluator encounters a multisymbol in operator position, it will evaluate all possible branches, and return the superposition of the outcomes:
(cleanup* (h-eval (h-read-code ((mix + - *) 10 10)))) ;(100 0 20)
similarly, if
evaluates multiple branches:
(cleanup* (h-eval (clj->vsa [['lambda [] ['if [possibly true false] :heads :tails]]]))) ; (:heads :tails)
A slightly more substantial piece of hyper lisp code:
(cleanup* (h-eval (clj->vsa ['let ['outcome [['lambda [] ['if [possibly true false] :heads :tails]]]] [impossibly 'outcome :tails]]))) ;; (:heads) ;; the reader cleans up the need for quoting all the symbols: (cleanup* (h-eval (h-read-code (let [outcome ((lambda () (if (possibly true false) :heads :tails)))] (impossibly outcome :tails))))) ;; (:heads)
Frames / Analogies / Templates / Possibility / Impossibility and so forth? ("What Is Programing With Concepts?")
- Eventually, perhaps we want to program with concepts, or analogies. Vocabulary ideas:
- tolerate - express a range of possible filler values, or express a range of tolerated properties.
- For instance, ballness (the ball frame) tolerates any kind of material, or color and so forth.
- require - express that a property must hold for the fillers, for instance, the shape of a ball is perhaps required to be approximately round.
- Everything other than having a round shape moves away from the notion of baldness in concept space.
- In terms of Mitchell and Hofstadter 1988 such a moving away constitutes /pressure on a concept, which might slip into another concept. This slippage and fluidity is a hallmark of creative cognition.
- A cognitive system might acquire such properties via matching actions and predictions:
- For instance, you might expect to trace the outline of a ball by moving the eyes in a circle, the ball to be graspable, rollable, throwable, be able to be covered by fingers, have a certain weight when held, have a certain distance when reaching out for it, have the same color from all sides, have a texture when stroked, throw a shadow when illuminated by a light source, sink or float in water (depending on its material), be called a 'ball' by other persons, be permissible to fulfill the role of 'ball' in the rules of games that require a 'ball' (or not, then interestingly so), be able to be put into the mouth, be bouncy or fragile depending on the material, be able to be brought to mind from memory, be able to be talked about, be able to have its properties listed, be different from a banana and from almost all other things for that matter, be able to be in the possession of a person
- This list tries to stay within "one degree of complexity", we can easily imagine a ball to be food by ball-eating goblins, to explode in a world where all balls are stuffed with gunpowder, to be worth a million dollars because it is a rare special ball relic of an ancient alien intergalactic soccer game, to be a dragon egg in a world where dragons hide their eggs disguised as ordinary balls
- This being able to put together ideas and concepts must be a deep aspect of computational building materials out of which ideas can be made.
- If any of these properties are stretched too far, for instance, the ball is made from jelly and doesn't roll anymore, then it's not a ball anymore.
- Perhaps a useful hyperlisp would be programming with counterfactuals [inspired by Charia Marletto and David Deutsch's 'Constructor theory'], with what if/s, /possibilities and impossibilities.
- Aaron Sloman [in personal communication]: impossibility, necessary connection, possibility spaces.
- (although he would argue there might be chemical computing at the synapses computing this for all we know).
Neuronal Syntax
- G. Buzsáki[2] points out that encoding in time, doesn't need to mean that something in time is being represented.
- A cognitive composite data structure of the brain might be encoded in something that is spread across time.
- In fact, Buzsáki thinks that neuronal syntax works by juxtaposing neuronal letters.
- The putative neuronal letter is an assembly firing in gamma frequency (Gamma waves), juxtaposed in time into neuronal words, that are 7 +/- 2 gamma timestep long (theta frequency).
- In software speech, the elementary or atomic datatype is an ensemble in gamma frequency. A first-order composite is a "theta word" of ~7 elements, (then higher-order composites are made from multiples again and so forth, forming a kind of rhythm hierarchy).
- The Magical Number Seven, Plus or Minus Two is a 'meta' empirical finding of cognitive psychology [Goerge Miller 1956].
- This would mean that the neuronal code45 is encoded in time.
- In other words, a neuronal interpreter would read a data structure spread across time.
- To be precise what I have in mind: Imagine you try to build a hyperdimensional ALU using neurons as input and output registers, implementing something like primitive superposition or bind operations, what are the input registers? If I need 2 inputs for a bind (I need 2 input registers), then what is the relationship between the inputs? I'm sure there is a range of possible answers…
- Suppose the inputs are neuronal activity juxtaposed across two subnetworks (maybe the first idea?), shouldn't these be kept in tandem or something? Intuitively, everything that subnetwork A is saying, I think B should be able to say, too. This seems a very strange requirement for brain neuronal net architecture.
- This problem is solved, when the input registers are the same neurons, but juxtoposed in time.
- The relationship is now 1:1. For instance, the overlap of A and B is now simple, it just means overlap.
- This would mean the brain computation is 'rhythmical' or 'musical'. [short talk;46, book-length: 47, review: 48], [Abeles e.g. Spatiotemporal Structure of Cortical Activity: Properties and Behavioral Relevance], [Earl K. Miller].
- If true, then Cognition uses a musical computation; I consider this a joyful twist, that suddenly puts the human taste for music right at the fabric of cognition.
- This might shed light on things like Flash lag illusion.
- If I might speculate, I would think the situation interpretation, ~ "essence finding" in terms of Hofstadter + Sander [2013]23, has something of a time window (perhaps 1 theta rhythm or 1/2th?), for finding the matching hippocampal trajectory (in terms of Buzsáki[2]), ~ that is roughly one sequential data structure coming out of an SDM here with 7 +/- 2 elements. (Assuming that the Hippocampus is roughly an SDM) - a neuronal word.
- I.e. there would be a 125 - 250 ms time window for finding an interpretation of what "is in short-term memory".
Making use of time:
- Using time for binding is an old idea [C von der Malsburg 1981, review, Abeles 1982]. (Bienstock 1996 ties it together with compositionality in a cool way.)
- An example of making a time-space tradeoff is this neuromorphic, spiking hardware implementation of the block code bind 49.
- This works via delay lines, similar to "polychronization feature binding" 50.
- Note that it doesn't mean that we (software engineers) need to encode something across time. We merely need to extract useful essential concepts.
Tom Riddle: [Adjusting ring on his finger, the same one in present day Dumbledore's office] Can you only split the soul once? For instance, isn't seven…
Horace Slughorn: Seven? Merlin's beard, Tom! Isn't it bad enough to consider killing one person? To rip the soul into seven pieces… This is all hypothetical, isn't it, Tom? All academic?
Tom Riddle: [Smiling] Of course, sir. It'll be our little secret.
(from Harry Potter Half-Blood Prince).
Literature
1
P. Kanerva, “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009.
Here and in 3 opens up a new space between computer science and neuroscience. Highly technical, but still accessible. This is truly fascinating work.
2
Buzsáki, G. (2019). The Brain From Inside Out. New York: Oxford University.
3
Sparse Distributed Memory, Pentti Kanerva 1988
One might check 1 first. And if one has a taste for more, SDM will be very interesting.
4
Laiho et.al. 2015 High-Dimensional Computing with Sparse Vectors https://www.researchgate.net/publication/299535938_High-Dimensional_Computing_with_Sparse_Vectors
5
Rafael Yuste Lectures In Neuroscience, 2023
4
Kanerva talks:
- Stanford Seminar - Computing with High-Dimensional Vectors
- Pentti Kanerva
- Pentti Kanerva - Modeling the brains computing with hyperdimensional computing
- Pentti Kanerva - The computer and the brain
- Pentti Kanerva - Cerebellum as a neural RAM (UC Berkeley, Fall 2010)
- Pentti Kanerva - Computing in Superposition in the Classical Domain
- Playlist: Some talks
- Related: The Kanerva Machine (went over my head at this point in time).
7
Pentti Kanerva 1993 Sparse Distributed Memory and Related Models
https://redwood.berkeley.edu/wp-content/uploads/2020/08/KanervaP_SDMrelated_models1993.pdf
8
Vector Symbolic Architectures as a Computing Framework for Emerging Hardware Denis Kleyko, Mike Davies, E. Paxon Frady, Pentti Kanerva, Spencer J. Kent, Bruno A. Olshausen, Evgeny Osipov, Jan M. Rabaey, Dmitri A. Rachkovskij, Abbas Rahimi, Friedrich T. Sommer
Footnotes:
Crawford E, Gingerich M, Eliasmith C. Biologically Plausible, Human-Scale Knowledge Representation. Cogn Sci. 2016 May;40(4):782-821. doi: 10.1111/cogs.12261. Epub 2015 Jul 14. PMID: 26173464.
That is, they are 10.000 bit wide vectors, if binary.
Tony A. Plate (1995), "Holographic Reduced Representations", IEEE Transactions on Neural Networks, vol. 6, no. 3, pp623-641, 1995, 21 pages
You can get the postscript version here: http://d-reps.org/index.html#software
The Computer and the Brain, John Von Neumann, posthumously 1958
Garry Marcus analyses the shortcomings of current AI approaches:
Louis A. Jaeckel, 1989 An alternative design for a sparse distributed memory
In Neuronal Ensemble Memetics, I argue that neuronal ensembles (pattern completed, distributed representations) can be seen as a theory of mentality. The requirement for that is to say how computational systems can represent perception states, motor plans, cognition states or perhaps mental content to itself.
Yuste conjectures ensembles to be the internal units of perception. Cool talk: Rafael Yuste: Can You See a Thought? Neuronal Ensembles as Emergent Units of Cortical Function.
Many people label this consciousness. I would like to call this brain software, or cognition software maybe.
The second sense of consciousness and qualia are is the the phenomenological consciousness sense. My feel is that a mere software theory will not provide a satisfying answer.
My feeling is, that this requires a theory of simulated virtual realities and therefore a theory of reality.
The new generation will have no trouble accepting that consciousness is a virtual machine. They are used to virtual machines when using the software of their computers and phones.
(From memory, not verbatim Dennett 2023.)
✓ Consciousness is a virtual simulated reality created by software - this is acceptable.
New questions:
- What is software?
- What is virtual?
- What is reality?
Chomsky recently said on YouTube: "What we don't understand is matter".
This is also reminiscent of the 4 strands of David Deutsch's The Fabrik of Reality.
https://youtu.be/JUK9PUTKPWU?si=cEA31y5iMHwU1GlY
I recommend the keynote:
https://arxiv.org/abs/2004.11204 Classification using Hyperdimensional Computing: A Review
Buzsáki G. Neural syntax: cell assemblies, synapsembles, and readers. Neuron. 2010 Nov 4;68(3):362-85. doi: 10.1016/j.neuron.2010.09.023. PMID: 21040841; PMCID: PMC3005627.
τ is the name for the membrane constant. In cortex principal cells this is 10-30ms. For a trail of literature: Neural syntax: cell assemblies, synapsembles and readers, 2
Buzsáki G. 2010 references
Koch C, Rapp M, Segev I. A Brief History of Time (Constants) Cereb. Cortex. 1996;6:93–101. https://pubmed.ncbi.nlm.nih.gov/8670642/
Alluding to things like bio-electric fields, of which I don't know enough to say much at this point in time
https://pubmed.ncbi.nlm.nih.gov/3749885/
Georgopoulos AP, Schwartz AB, Kettner RE. Neuronal population coding of movement direction. Science. 1986 Sep 26;233(4771):1416-9. doi: 10.1126/science.3749885. PMID: 3749885.
Binding and Normalization of Binary Sparse Distributed Representations by Context-Dependent Thinning Rachkovskij, Kussul 2000
BSC Fully Distributed Representation Kanerva 1997
Buzsáki, G. (2019). The Brain From Inside Out. New York: Oxford University Press.
That HDC can get results by requiring little else than randomness is technically sweet and biologically unproblematic.
Zhonghao Yang 2023 Cognitive modeling and learning with sparse binary hypervectors
https://paperswithcode.com/paper/cognitive-modeling-and-learning-with-sparse#code
Rachkovskij 2001
Surfaces and Essences: Analogy as the Fuel and Fire of Thinking, 2013 Douglas R. Hofstadter, Emmanuel Sander
Ashok Litwin-Kumar: Generalizing theories of cerebellum-like learning
Cool computational neuroscience talk
Jaeckel, L.A. 1989a. An Alternative Design for a Sparse Distributed Memory. Report RIACS TR 89.28, Research Institute for Advanced Computer Science, NASA Ames Research Center.
Cell Assemblies in the Cerebral Cortex, V. Braitenberg 1977
Braitenberg V. Is the cerebellar cortex a biological clock in the millisecond range? Prog Brain Res. 1967;25:334-46. doi: 10.1016/S0079-6123(08)60971-1. PMID: 6081778.
But without first-class continuations, haha
Minsky 1974, A Framework for Representing Knowledge/
https://courses.media.mit.edu/2004spring/mas966/Minsky%201974%20Framework%20for%20knowledge.pdf
(all his talks are great).
Rafael Yuste Lectures In Neuroscience, 2023
Daniel Dennett I've Been Thinking, 2023
RIP one of the most relevant influences on my world view.
I independently derived the same concept of dynamism or fluidity. Called memetic engine here.
- In a 'subcognitive' model: randomness in service of intelligence [Mitchell and Hofstadter 1988].
- The same thing for neuronal ensembles empirically: Drifting assembles. Intrinsic firing rate and random skip rate are a feature, not a bug. It's only possible to regard it as a feature, if we consider the substrate, not the elements.
- I recently came across this podcast where M. Levin, K. Friston and C. Fields label the concept babble. Apparantly, quantum babble is the same thing in a physics theory.
Cell Assemblies in the Cerebral Cortex, V. Braitenberg 1978
https://link.springer.com/chapter/10.1007/978-3-642-93083-6_9
G. Palm Neural Assemblies: An Alternative Approach to Artificial Intelligence, (first edition: 1982, 2nd ed.: 2022)
Braitenberg On the Texture of Brains -An Introduction to Neuroanatomy for the Cybernetically Minded, 1977
See 'assembly calculus':
Assemblies of Neurons Learn to Classify Well-Separated Distributions, Santosh S. Vempala's homepage.
Tzitzitlini Alejandre-García, Samuel Kim, Jesús Pérez-Ortega, Rafael Yuste (2022) Intrinsic excitability mechanisms of neuronal ensemble formation eLife 11:e77470
5: neuronal code is an analogy to genetic code, the genetic code is defined by the interpreter, the ribosome. The syntax of genetic code are base triplets, or codons. Each triplet corresponds to an amino acid or is one of the stop codons or start codons. See Genetic Code.
Analogously, the project of finding a neuronal code or codes would mean understanding what the interpreter(s) and the syntax are.
G. Buzsáki Rhythms of the Brain, 2011
Buzsáki G, Vöröslakos M. Brain rhythms have come of age. Neuron. 2023 Apr 5;111(7):922-926. doi: 10.1016/j.neuron.2023.03.018. PMID: 37023714; PMCID: PMC10793242.
https://dl.acm.org/doi/fullHtml/10.1145/3546790.3546820
Sparse Vector Binding on Spiking Neuromorphic Hardware Using Synaptic Delays
https://pubmed.ncbi.nlm.nih.gov/29951198/
A 'basic polychronous group'[S. Singer and collaborators] A and B share a reader C. Via delay lines, C represents the notion that A activates B.