ADC2018

When I kind of decided that I leave Xamarin development job and dive into audio development world in early summer, I had one thing to do in mind: join Audio Developers Conference (ADC) 2018.

The two days conference (three days if training day is counted) is over, and I was quite happy there. There was an announcement about a new language called SOUL, the SOUnd Language. You can find the session recording on Youtube.

https://youtu.be/UHK1QQM0bnE?t=910

These days I had been checking various music and audio related software libraries and tools to write many blog entries for my experimental audio tech/library/tools Advent Calendar in Japanese, and reading a handful of text about this kind of stuff for it. I think I understand how awesome this SOUL stuff could be (as a skeptical developer I would keep saying that it was still just an announcement) and thought I should describe why. It is partly translated from what I have prepared in Japanese, even before ADC had started.

I have to say, I have never been familiar with audio development. I joined ADC2018, while I didn't know that it is actually mostly about JUCE. I kept telling friends like "I will be joining ADC to possibly find a next job there", but I was not sure what it is like. I was not even serious about joining it ("I would join it, but I'd quickly give up if something more fun thing happens" kind of). There are things that are likely obvious to those audio devs attending there and yet I didn't know. I am never a dedicated C++ programmer. But ADC was still fun, I learned a lot of things.

So, what is APL like?

SOUL is meant to be a new APL(audio programming language) and further, platform. APLs are languages, not just a single language (and here it does not mean "A Programming Language"). Take it like VPLs-alike (visual programming languages).

They designed to process audio and compose (or more precisely, "create" ?) music or realize sound effects. Ideally, they are designed for non-programmers (or maybe "not very advanced" programmers).

They are actually language-based tools rather than language specifications. Examples of APLs are: Csound, ChucK, Pure Data, Alda, Faust, Tidal. What is interesting there is, they always define their own languages. Actually some of those languages are simply based on Lisp or Scheme and it is argurable that they are under their own "language", but let's skip that so far, they ultimately end up with their own ecosystem.

An interesting aspect of those languages is that they never make use of virtual machines like Java or .NET. Describing music or sound effects would be very useful especially in apps like games, and there are many apps or games written in Java or .NET. Why should we learn a completely different language? Can't they just be some audio libraries like raw-audio libs or raw-midi libs so that we can arbitrarily manipulate those sound objects...?

However, those APL designers have reasons to do so. Interestingly some of those are described in academic papers. Among those papers I pick up an interesting one from Andrew Sorensen, regarding his language Extempore. It is described from very primitive level, examining every possibility to achieve minimal latency.

https://openresearch-repository.anu.edu.au/handle/1885/144603

Extempore is a language for "live coding" (he calls it a "cyber-physical" programming language). You can find it from this keynote recroding from OSCON 2014:

https://www.youtube.com/watch?v=yY1FSsUV-8c

Audio processing use cases

Typical jobs that APLs deal with are often about "realtime" processing of audio. What are such jobs for example?

  • Think about a virtual piano app. You press a key, and the app reacts it by sending out piano sound to audio output. If you pressed a key and it took one second, then you don't want to use it.
  • MP3 players decode *.mp3 files and send the result to audio output. Time of decoding is usually shorter than raw PCM playtime, but if the output was flaky then it's useless.
  • DAWs (digital audio workstations) are used to compose music in mutiple tracks. Depending on songs, it might have to send audio outputs as well as MIDI outputs in sync. If they are out of sync, the resulting music does not sound right.
  • Think about live performance using digital instruments (like the OSCON keynote video), optionally with visuals synchronized. If the app freezes even very short time like less than 0.1 sec, it is still problematic.

Those tasks might not necessarily be "realtime" in the context discussed here, but those tasks are under tight requirement on "precise" processing time.

Realtime requirement

So, it seems precise processing time is important. But how precise they need to be? How "realtime" should they be? 1 second delay is obviously wrong. 100 milliseconds? 10? 1? ... It's not obvious to us anymore. Similar realtime-ness is required in 3D animation frame rates in VRs e.g. 60fps, 120fps... Audio is similarly constrained, it is said that if latency becomes like 20 milliseconds of delay then some people would notice.

I should make one point clear; realtime processing is NOT about high-performance processing as in "most efficient performance". The requirement is: a recurring task must be always invoked and run within the expected time frame. It is in a sense completely opposite of high-performance computing.

In modern general computing environment, there are preemptive multitasking (processes and/or threads) that are managed by the operating system. The Sorensen paper discusses other possibilities (manually-cooperative multitasking) and they are real on some embedded environment, but for APLs it would be mostly preemptive world. (Devices like ROLI Blocks and Littlefoot compiler might be the other way.)

In any case, what is important here is that such a realtime app needs to run on a raltime thread (or process) which needs to be supported by the OS. General threads don't have such guarantee that it must be invoked in "timely" manner.

And virtual machines have further problems. Namely, garbage collectors "stop the world" (including app threads) to collect unused memory blocks, and JIT compilers compiles VM code to the actual machine code at run-time. They result in uncountable delays.

Sorensen mentions SonicPi on Ruby, Gibber on JavaScript, Impromptu on Scheme, and Overtone on Clojure as examples.

Therefore, languages like Extempore, avoids those problems by designing their own language to generate native code statically, and require explicit memory allocation. These days LLVM-IR is the common code generator solution to them. Probably Lisp/Scheme parsers were their frontend solution too, that's my guess.

As of 2018 we, virtual machine based developers, would have some words on that premise (e.g. we have AOT solution), but let's put it aside. Today I'm more interested in what SOUL provides.

Language runtime and sound server communication

One interesting point that Sorensen made in that paper was that a full stack solution vs. half stack solution for an APL - it is still possible to write client-server system, and it can be either intra-process or inter-process. So, instead of implementing everything in one single language and framework, it is possible to use multiple languages and frameworks, implementing each interested parts and cooperate together.

For intra-process design, there would be a realtime sound server implementation and client language bridge by FFI. According to the paper, ChucK, Impromptu, and Fluxus are based on them. For inter-process model, SuperCollider is an example. Extempore is based on inter-process model too, at this state (the paper says it's been inbetween those two models).

the Language Barriers

Leaving Sorensen paper aside, I have been always interested in having some sound objects model that can be manipulated in C# and Mono. I had almost no interest in whatever .NET Windows people had created. Windows-only stuff will just die soon. I'm only interested in cross-platform solution.

There are still some hackers who created portaudio bindings or rtaudio bindings. I have my own binding for libsoundio, but there are other people who did it too. There is Web Audio API in Javascript world, and at ADC2018 there was a session about bringing audio API into C++ Standards ("std::audio"). There is no such thing in .NET. They are years behind C++ or Web.

Anyhow, the next step to raw audio, what I wanted was to have some common sound object models that those APLs could possibly share. Right now they are living around Babel tower, but some commonplace could help improve the situation towards cross-language solution.

Then I can instantiate audio output tracks and play in timely manner, coordinating them all, just like DAWs do. I have my own MIDI player and even text macro compiler to generate MIDI files, so I had some foundation.

I had some look at a couple of APLs, and I always end up to find no general music solution there. Their outputs are specific to some genres of music. It's not what I had seen back in 20th century in Japan - people use FM synthesizers and MIDI instruments for various kind of music, even with limited expressiveness.

My understanding is that there can be unlikely good language. Instead there could be common sound platform among any new possible languages.

Another concern I had was that I should probably write everything in C (or at least provide C API). My primary language was C# (although I'm practically free to choose any language now that I'm independent) and C# still has many developers, but if it has technical difficulty (like discussed above) then it is just a bad choice on technology.

The client-server model on that Sorensen paper was then likely the next thing I could try i.e. design primitive sound object system and manipulate them privately via C#.

To be honest, I don't really care much about audio latency. But I would mind if I were going to perform some sort of live coding, and that's still a possible future.

where SOUL takes place

SOUL announcement just appeared when my idea is like that.

soul-worst-case-scenario

(I have no images storage so I put them on Google photos, that you have to navigate to them...)

This is a screenshot from the video streaming. You can use any language to write a programm that manipulates SOUL API, which is likely something like OpenGL Shader. The SOUL language sources are compiled into some native code using LLVM IR and run by JIT.

What's good there is that the sound platform itself does not give up audio latency. It actually aims to provide better solution, by providing chance for hardware-accelerated processing of the sound platform:

soul-hardware-accelerated-case-scenario

Hardware driver accelerated programs run on better (less) latency. (Of course there are some business opportunities that the SOUL language developers could aim at.)

Yes, JIT is something that those APL designers worry about, so as I assume in theory. The solution here is similar to how shaders are compiled i.e. GPUs. GPUs are popular, but for this SOUL devices we would be mostly skeptical. That may or may not be a matter of time, I have no idea now.

Thoughts

My concern on SOUL is mostly about compilation pipeline, especially on how instant the turnaround time would be. If generating piano sound from keypress is somewhat slow, then it may not be a solution.

Extempore, as a live-coding ready language, compiles their XTLang sources into LLVM IR so that when it is invoked it is instantly executed. But it also involves compilation, which those language users would pre-compile on their editor. Compilation should still take some time, unless the language itself is not very expressive.

SOUL itself is not likely highly expressive (it does not seeem to aim to be a universal language) so the client languages (for me maybe C#) have to generate fairly complicated code.

But since there is no actual piece of code out yet, we'll see what happens. Even if it was just a vaporware (which I guess highly unlikely), it is still a great idea that could be accomplished by anyone so far.