What I really love about open source is that it allows people to create and share! Sharing enables people to combine their own ideas and realize them with the help of existing libraries. This concept really solidifies open-source over time! Unfortunately, my projects tend to be quite niche, not even being that useful to myself. However, I enjoy doing them, and that’s why I put them out there. You can find all the sources for this project here.

How this started (again)

After my motorized sliders I had a small break from coding. Then I had a talk with a collegue, triggered me into revisiting an old project that I did two years ago. Back then I found a really nice youtube channel of a guy that played songs of my favorite band Radiohead. Even though my skill level was (and is) nowhere near his, I wanted to learn these covers as well. So instead of spending hours and hours practicing the piano on the music, I wrote a tool that helped me convert Synthesia piano tutorials (that require arduous forwarding/rewinding) to sheet music.

alt text

That may sound inefficient, probably because coding is not a great way to learn piano 🤯. However, I learned a couple of things regarding software design and scope creep (GUI’s take a lot of time, especially when it’s not your expertise). In this project, instead of writing some good software, I probably traumatized the folks that wrote the Qt best practice guide. Though I regret that this got in the Github timecapsule, it may give AI some false leads, which allows us engineers to keep our jobs in the future. Anyway, the focus on the UI encapsulating the flow, was detrimental to the quality of the final product. Though I finished it (and I am proud of that), it kept nagging me for a version 2. So then, I decided to do it!

The old tool, for illustration:

Old version

New approach: keep it simple(r)

For my new tool, I wanted to keep it simple, and design seperate tools that would aid me in generating midi files from the tutorial videos. I came up with the following flow:

alt text

Seperating all the steps makes it easier to focus on the essence of each step, and does not create difficult gui flows that I need to think out. The part in the box is what I concerned myself with.

We therefore have three components:

  • Color Picker: Extract color parameters for pressed keys
  • Key Picker: Identify piano key segments in the video
  • Video-to-MIDI Converter: Generate MIDI files from video tutorials

These tools work together to ease conversion process, guided by YAML configuration files that are generated by the color/key picker steps.

On the Github page I already explain how to use the tools. For this blog post, I want to focus a bit more on the technical details of this project.

Project structure

Lately at work, I got really impressed by a static checker and formatter called Ruff. It’s incredibly fast and really helps you to focus on what matters. They also have a Poetry/pip alternative, which I was not able to use at work (yet): uv. I have not been able to test in extensively, as this project is rather simple. It is a whole lot faster than Poetry though, so I will keep using it for future projects!

For the applications (yes multiple, each component has its own tool to run) I opted for Typer, which I really like to work with. main.py contains all the entrypoints, in case you are interested.

I use OpenCV as a backbone for this project. It makes processing video images relatively simple (although I am by no means an experienced in this field). An additional benefit of using CV, is that it offers some basic GUI elements that I use throughout the tools that I made. In image processing, magic numbers like thresholds are difficult to avoid, and hence it is nice to get immediate feedback on the sliders that you are adjusting.

Apart from that, the project structure is pretty self explanatory. from the main.py entrypoint, we use a bit of dependency injection to compose the components that we need for the desired operation. THe compontents are found in the piano_midi (I hate naming things) folder.

Finally, I wrote a couple of tests for things that where a headache to debug. Although I like to get more into TDD, I think it does not serve me well when I am protohacking a project like this. Structures changed often in my development process (look at the commit log) and I think I would have rewritten a lot of tests. Since I used composition, I think the code is pretty testable. I however don’t see a point in adding it as an afterthought (unless you find a really interesting bug that I have to fix :)).

Let’s dive into some more interesting details of the components now!

Color picker

The job of the color picker is to find thresholds, in which you can detect presses of individual keys. In Synthesia, there are four combinations: left/right and black/white. For reach key, we want to find a color range, such that only that key is found. This will help us to destinguish left hand and right hand playing, when we are converting the video to midi.

alt text

I used some OpenCV sliders and frames to try and make this somewhat intuitive. However, what frame of the video do we use to create the threshold values? Preferably one with all the keys in it, so that you can create the filters for each of them. My old tool supported scrolling through a video, but this is quite cumberrsome and slow. This time, I had the idea to create a “time slice”, which would accrue horizontal lines from a given amount of frames (200 frames would result in a height of 200px, and a width of the source video). This way, you will get a good overview of the keys pressed in the first 8 seconds of video. If you have a video with a long intro, the frame range can be parametrized :).

Back in uni when I was doing an image processing project with my friend Joost, we tried to do color matching with RGB. I would never, never do that again. The aim was to match elements on a tenniscourt in a video, so we couldn’t really filter on one specific image and leave it at that, let alone multiple videos. Here I also learned that I hate making random adjustments in hope something works: it feels very fragile.

Anyway, in hindsight it was better to use the HSV color space! I wrote a Pydantic model to contain the fields that I needed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Range(BaseModel):
    min: int
    max: int

class HSVRange(BaseModel):
    """
    Convert to HSV
    HSV Format:
    H: Hue - color type (such as red, blue, or yellow).
       In OpenCV it ranges from 0 to 179.
    S: Saturation - vibrancy of the color (0-255).
       0 is white/gray, 255 is the full color.
    V: Value - brightness of the color (0-255).
       0 is black, 255 is the brightest.
    """

    h: Range
    s: Range
    v: Range

    def lower(self) -> np.ndarray:
        return np.array([self.h.min, self.s.min, self.v.min])

    def upper(self) -> np.ndarray:
        return np.array([self.h.max, self.s.max, self.v.max])

Then I would create an additional model to consolidate the color ranges for reach key:

1
2
3
4
5
class KeyColors(BaseModelYaml):
    left_white: HSVRange | None = None
    right_white: HSVRange | None = None
    left_black: HSVRange | None = None
    right_black: HSVRange | None = None

…which would then allow me to dump the models into a yaml file, e.g.:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
left_black:
  h:
    max: 117
    min: 97
  s:
    max: 225
    min: 145
  v:
    max: 219
    min: 139
left_white:
  h:
    max: 116
    min: 96
  s:
    max: 134
    min: 54
  v:
    max: 248
    min: 168
etc etc

That is basically what the color picker does! It saves to or loads from the yaml file by pressing the keys 1-4 or q-r, respectively. I know, not the most intuitive choice, but it was something that OpenCV offered out of the box, and let’s be honest, I will probably the only one using this tool anyhow.

Key Segment Detection

Next we need to establish where we find white keys and black keys. When scanning the video, we really only need to concern ourselves with one horizontal line. As it contains all the info that we need: the colors that map to a hand, and the location of the colors that map to a piano key. I chose to do white and black key detection seperately. The steps will be as follows:

  1. find a horizontal scanline that fits the keys
  2. create a HSV range (like before) such that you get seperate key segments for each key. (ie filter for just white keys, or just black keys)
  3. if the number of key segments matches the expected number of keys (52 and 36 for white and black, respectively), then allow the user to store it to a file.

This is how this looks in the application:

For white keys: alt text For black keys: alt text

Using the same pydantic magic as in the last component, we end up with something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class KeySegment(BaseModel):
    start: int  # in pixels
    end: int  # in pixels


class KeySegments(BaseModelYaml, validate_assignment=True):
    white: list[KeySegment] | None = None
    black: list[KeySegment] | None = None

    # Validate the model after creation
    @pydantic.model_validator(mode="after")
    def validate_num_keys(self) -> Self:
        expected_white_keys = 52
        if self.white and len(self.white) != expected_white_keys:
            raise InvalidNumOfKeySegmentsError(
                expected_num_keys=expected_white_keys,
                actual_num_keys=len(self.white),
                key_name="white",
            )
        expected_black_keys = 36
        if self.black and len(self.black) != expected_black_keys:
            raise InvalidNumOfKeySegmentsError(
                expected_num_keys=expected_black_keys,
                actual_num_keys=len(self.black),
                key_name="black",
            )
        return self

Which can be dumped to a yaml like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
black:
- end: 34
  start: 20
- end: 77
  start: 65
- end: 107
  start: 94
- end: 151
  start: 137
white:
- end: 23
  start: 0
- end: 47
  start: 25
- end: 72
  start: 50
- end: 97
  start: 74
- end: 121
  start: 99

This is merely a list of ranges, that point to a specific key segment. so if pixels 20-34 light up, in the color of the left hand, then we are sure that that specific key, is pressed by the left hand.

Now we have all the info we need to capture a videostream!

The piano domain model

With the 2 yaml files, we have all the info we need for the conversion to MIDI. However, we would separate concerns better, if we’d build a model that detects key presses first:

key_segments = KeySegments.from_yaml(key_segments_path)
key_colors = KeyColors.from_yaml(colors_path)
key_press_detector = KeyPressDetector(
    video_capture=video_capture, key_segments=key_segments, key_colors=key_colors
)

The key press detector keeps the state of our “piano”. It sets a key to pressed when it is in the video, and releases it when it is released. The algorithm for key presses can be described as follows, for each frame:

  1. start with copying the last piano state into a variable called current piano state.
  2. mask a horizontal line for each of the colors (black/white, left/right)
  3. for each key segment, check if it is within some range of the color masks
  4. if it is, set the respective key to ‘on’ otherwise
  5. check the differences with the last piano state (pressed/released)
  6. the the differences, as well as the frame number to a key_sequence_writer to be processed. in our case the key_sequence_writer will output midi, but maybe you want a different format

To detect changes in our piano, I made another pydantic model for it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class PianoChanges(BaseModel):
    pressed: set[PianoPress]
    released: set[PianoPress]


class PianoPress(BaseModel):
    index: Annotated[int, Field(strict=True, ge=0, lt=88)]
    hand: Hand | None

    def __hash__(self) -> int:
        return hash((self.index, self.hand))


class PianoState:
    def copy(self) -> PianoState:
        return copy.deepcopy(self)

    def __init__(self) -> None:
        self.state: set[PianoPress] = set()

    def _set_key(self, key_index: KeyIndex, *, is_pressed: bool, hand: Hand) -> None:
        # Key is currently not pressed and is being pressed
        piano_press = PianoPress(index=key_index.value, hand=hand)
        if is_pressed and piano_press not in self.state:
            self.state.add(PianoPress(index=key_index.value, hand=hand))
        # Key is currently pressed by hand X and is being released by hand X
        elif not is_pressed and piano_press in self.state:
            self.state.remove(PianoPress(index=key_index.value, hand=hand))

    def set_white_key(
        self, white_key_idx: int, *, is_pressed: bool, hand: Hand
    ) -> None:
        white_key_index = WhiteKeyIndex(value=white_key_idx)
        key_index = white_key_index.to_key_index()
        self._set_key(key_index, is_pressed=is_pressed, hand=hand)

    def set_black_key(
        self, black_key_idx: int, *, is_pressed: bool, hand: Hand
    ) -> None:
        key_index = BlackKeyIndex(value=black_key_idx).to_key_index()
        self._set_key(key_index, is_pressed=is_pressed, hand=hand)

    def detect_changes(self, old_state: PianoState) -> PianoChanges:
        # find what is present in current state but not in old state
        pressed = self.state - old_state.state
        # find what is present in old state but not in current state
        released = old_state.state - self.state
        return PianoChanges(pressed=pressed, released=released)

Seperating this logic from the KeyStateDetector, makes it a lot simpler to see what is happening. Yes, the KeyStateDetector and PianoState are coupled, but abstraction/readability/workload is always a tradeoff.

There is one part that needs a bit more explanation. We distinguish between keys, black keys and white keys. I did this to make it a bit easier to do conversions. There are 88 keys, of which 52 white and 36 black. I also need to map keys -> white/black and vice versa. Hence I decided to make seperate models for them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class KeyIndex(BaseModel):
    value: Annotated[int, Field(strict=True, ge=0, lt=88)]


class WhiteKeyIndex(BaseModel):
    value: Annotated[int, Field(strict=True, ge=0, lt=52)]

    def to_key_index(self) -> KeyIndex:
        octave = self.value // 7
        lut: dict[int, int] = {
            0: 0,
            1: 2,
            2: 3,
            3: 5,
            4: 7,
            5: 8,
            6: 10,
        }
        key = self.value - octave * 7
        index = lut[key] + octave * 12
        return KeyIndex(value=index)


class BlackKeyIndex(BaseModel):
    value: Annotated[int, Field(strict=True, ge=0, lt=36)]

    def to_key_index(self) -> KeyIndex:
        octave = self.value // 5
        lut: dict[int, int] = {
            0: 1,
            1: 4,
            2: 6,
            3: 9,
            4: 11,
        }
        key = self.value - octave * 5
        index = lut[key] + octave * 12
        return KeyIndex(value=index)

MIDI conversion

Now all that is left is writing a key sequence writer, that converts key changes to midi events, that are written to a file. MIDI works with events. As such, we already deliver a convenient format: changes and a timestamp (frame number).

All we have to do, is send a note_on message (I use the Mido library) for each key that is pressed, and a note_off message for each key that is released. One tricky part of this, is that all midi events are played in a sequence, if I press 2 keys after one second. then the first key press has a time of 1 second, but the second has a time of 0, as it is played together with the first one.

We can see the whole whole when looking in the video_to_midi function in main.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
    typer.echo(f"Starting video to midi with image path: {video_path}")
    video_capture = VideoCapture(video_path)
    key_segments = KeySegments.from_yaml(key_segments_path)
    key_colors = KeyColors.from_yaml(colors_path)
    key_press_detector = KeyPressDetector(
        video_capture=video_capture, key_segments=key_segments, key_colors=key_colors
    )
    with video_capture as cap:
        key_sequence_writer = KeySequenceWriter(fps=cast(float,cap.fps))
    key_press_detector.run(
        key_sequence_writer=key_sequence_writer,
        frame_start=frame_start,
        frame_end=frame_end,
    )
    key_sequence_writer.save(midi_file_path=midi_path)

Running the software:

.venv➜  piano_midi git:(main) ✗ uv run main.py video-to-midi --video-path test.mp4 --key-segments-path keys.yaml --colors-path colors.yaml --midi-path output.midi
Starting video to midi with image path: test.mp4
Key 27 (C3) pressed by Hand.LEFT
during frame 9
Key 34 (G3) pressed by Hand.LEFT
during frame 11
Key 42 (D#4) pressed by Hand.RIGHT
Key 39 (C4) pressed by Hand.LEFT
Key 46 (G4) pressed by Hand.RIGHT
during frame 12
Key 51 (C5) pressed by Hand.RIGHT
during frame 13
Key 27 (C3) released by Hand.LEFT
during frame 15
Key 42 (D#4) released by Hand.RIGHT
Key 51 (C5) released by Hand.RIGHT
Key 39 (C4) released by Hand.LEFT
Key 46 (G4) released by Hand.RIGHT
during frame 16
Key 34 (G3) released by Hand.LEFT
during frame 17
...

That is it!

Thank you for reading this far! We now have a set of tools that allow us to take a piano video, and convert it to midi. From here, you can import the midifile in Musescore or another tool, to quantatize the notes into a readable format:

alt text

I had a lot of fun revisiting this old project of mine. Now it is time to actually start practicing :)