We now subscribe to an event and wait for frames when capturing audio
input, the big avdantage is that we only have to fetch the frames from
the hardware once, and we don't need to cache anything.
The frames are simply dispatched to the client's callbacks immediately.
Also removes some outdated ifdefs that did not apply anymore.
And properly handle toxav happily delivering things out of order,
like firing a video frame callback right after a callback setting the bitrate to 0,
when the peer sent these commands in the right order