Scroll down to V2 below. V1 is very poorly written.
Here’s a refined algorithm for tokenizing a Lakh-flavoured pop music MIDI file for music theory processing.
Suppose we generate measure 20. Suppose it’s identical to measure 16. We encode this with a token “repeat_m_4”, where 4 is a relative distance back. At all times choose the closest measure to reference back (exception: if you can repeat “repeat_m_i” many times, simply repeat it).
One expected consequence of relative pitch encoding is that a prior tonic estimation isn’t necessary, at all, whereas the NN trained on this tokenization will probably be useful in inferring the tonic afterwards.
The next question is how to leverage basic-pitch recognition on Spotify data to further increase the model. I suspect that there are easier and harder genres, and we might start with easier ones.
Not all MIDI files are equal. First, focus on narrowing a dataset by excluding classical music, solo piano works and jazz.
After some experiments, I’m trying to write up a better tokenizer.
A tokenizer works on a grid of measures and channels (aka tracks aka voices). A cell is a measure+channel pair as a bunch of onsets. Eg. cell m.20 ch.3.
A tokenizer makes two passes through all cells via two nested loops:
for measure in measures:
for channel in channels:
cells[measure][channel] = encode(cells[measure][channel], ...context)
In the first pass, it transforms raw MIDI onsets into the intermediate representation (IR). In the second pass, it works on IR and tries to replace as much of it as possible with reference tokens marking repetition/doubling/transposition/reuse of ideas of any sort.
The main idea: strumming patterns, drum patterns and melodic patterns should be referenced inside a song using special reference tokens. In order to do that, every instance of a pattern should be decoupled into a bag of notes and a pattern skeleton. A bag of notes is several MIDI numbers of notes. Inside a skeleton, they are referenced via a local numbering as s_0, s_1, s_2… from lowest to highest.
Ideally, this decoupling should happen on units of harmonic rhythm (for chord tracks) and most common melodic lengths (for melodic tracks). For a slower harmonic/phrase rhythm, it shouldn’t be an issue due to the second pass.
In Western pop songs, there is usually a bass track that has a good informational reference for other tracks to refer to. Also, this track is usually monophonic. So, I propose to encode bass notes from left to right throughout the entire track by relative references. This way the typical patterns like 2-5-1-6, chromatic descending, diatonic descending etc. will all naturally be visible, even though we don’t make any tonic inference preprocessing.
Then a key assumption is that all other tracks get the most useful information for pattern extraction by referring to the first bass note in the same measure. (An exception to that is an anticipation bass note which started in a previous measure but is hearable through the measure start.)
Let’s encode a bag of notes for any cell above the bass. Let’s find a pivot MIDI number - a lowest transposition of a first bass note (+0, +12, +24, …) such that it’s either in a bag of notes or the bag of notes is as close to it as possible. Then this pivot is encoded as oct_0/oct_1/oct_2 etc., and the bag of notes is encoded as rel_-5, rel_0, rel_4 - this example is for a major chord in second inversion. (An alternative encoding could be to encode notes relative to one another - exposing intervals rather than chord degrees.)
In most MIDI files we have a good grid of measures and beats that is constructed from time signature and BPM events. Therefore, every onset can be encoded as beat.subdivision.
For better pattern extraction it’s probably better to work with time shifts between two onsets rather than with absolute times within a measure.
A major scale in 8ths will look like this: s_0 t_0.00 n_+1 t_0.50 n_+1 t_0.50 n_+1 0.50 n_+1 0.50 ..., assuming its bag of words is something like oct_2 rel_0 rel_2 rel_4 rel_5 rel_7 .... A s_0/s_1/etc. tells which note from the bag of notes starts the pattern. Then notes in a pattern are references relative to their order within the bag of notes as n±1, n±2 etc. A t_0.00 is an absolute coordinate for a first onset inside a measure. (Is there a better way to unify t_ and ts_?)
Two shorthands are used:
A swing eights strumming will look like this: n_0 n_1 n_2 t_0.00 ts_0.67 ts_0.33 ts_0.67 ts_0.33 ....
Drums are unlike notes. Kick, snare and hi-hats have somewhat independent patterns that can be stacked on top of each other. On the other hand, all three hi-hat events are maybe connected.
For now we can encode them as drum_35 t_0.00 ts_2.00 drum_42 t_1.00 ts_3.00 ....
Again, the outer loop goes over measures, the inner loop goes over channels (in order from lowest to highest average channel pitch), assuming all previous cells are already encoded.
If this particular cell is already a part of a large repetition found and encoded previously, we skip it. Otherwise, we try to find the best strategy to encode its most probable semantic relation to the content seen before (in previous measures + in cells below this one in the same measure). There’s a semantic hierarchy on repetitions:
(Whenever we encode that a target sequence of cells partly repeats a source sequence of cells, we assume that they don’t overlap.)
repeat_D_L.double_C_L.riff_N.transpose_D_L_N. (A repeat_D_L is a special case of transposition with N = 0. My gut feeling is that a diagnosed doubling, especially of longer length, is more important than a transposition.)Here comes the coolest part. Three previous steps couldn’t help us encoding a transposed melody or a repeated strumming on a different chord quality within a key. This is because our tokenizer has no notion of scales. To find it, we try to encode a pattern and a bag of words separately.
Intuitively, encoding a pattern is more important because a pattern is more likely a uniquely used in this piece, whereas a harmony (a bag of notes) can more likely be generalized when processing a large corpus. (Although this is probably not relevant.)
pattern_D_L. Probably not very important for very short patterns (of 1 or 2 notes) 50 measures ago.harmony_D_L.What else can now be reused? Some partial repetitions of notes between two measures. Although this is harder to tokenize with a room for generalization: we’ll need to remove and add arbitrary notes, and the transformer should be very confused on the small dataset like the one we have.
Also, we don’t encode diagonal repetitions - which start some time ago in a different channel. We don’t know how common it is - it’s certainly common in classical orchestral music.