Reliability of AudioChannel.PositionMs

Started by Sakspapir, Mon 20/03/2023 23:01:31

Previous topic - Next topic

Sakspapir

I'm wondering how reliable the above mentioned function is.

I want to start playing a conversational audio clip and continuously (for every game cycle) read AudioChannel.PositionMs to see how far the clip have played. Based on the return value, I will display subtitles using a gui label. The game is 60fps, so just to be clear: I know that each game cycle is approximately 16 ms. By checking if PositionMs returns an int that is larger than some predefined value, I would have precision no worse than 16ms, which is acceptable.

I'm using mp3 files for the audio.

Anyway, I just want to be sure that, for instance, reading a value of 5000 milliseconds from PositionMs always refers to exactly the same point in the mp3 file. otherwise there might be some accumulating offset between the subtitles and the voices, which is not good enough.

Thanks in advance!

eri0o

#1
Don't use mp3, use ogg instead. If I remember correctly ogg files have a fixed bitrate and mp3 files can have a variable bit rate, when the mp3 has a variable bitrate it's length I think can result in a gap - but position in MS in both cases should work.

But overall, ogg can compress voice better than mp3, so you get a smaller file size too, so overall, I think it's always best to use ogg files. Really, mp3 is not a good file format and if you can you should avoid.

If syncing is a must, and it's only Voice, you can use Audacity to label your audio source and export your audio in sliced pieces - I recommend in Audacity also using fade in and fade out at each begin and end of sentence to avoid a puff sound, in this way, if you want later, you can also have it run in some dialog system, so that the text doesn't advance by itself and a person can press a key to advance a text.

Sakspapir

Thanks for the reply!

The Audacity approach you mentioned is exactly what I have been doing. I wanted to avoid doing that, since it results in tons of small audio files. It's a perfectly good way of solving my issue, but if I could reliably use the PositionMs property, I could just note down the timestamps at the start of each sentence. I think this would save me a lot of time and file handling.

Quote from: eri0o on Mon 20/03/2023 23:49:11If I remember correctly ogg files have a fixed bitrate and mp3 files can have a variable bit rate, when the mp3 has a variable bitrate it's length I think can result in a gap

I did some googling, and it seems it's actually the other way around. In the Audacity manual, it says:
QuoteOGG encoding is technically a form of VBR or variable bit rate encoding, as opposed to the CBR or constant bit rate encoding used by default when exporting MP3 files.

I guess the root of my question is how does PositionMs actually work? Is it something like this:

a) AGS starts playing a clip and starts some internal timer mechanism. When PositionMs is read, it checks the timer and returns the value

b) AGS reads out the time data directly from the sound file itself.

For my use, It must function something like alternative b), otherwise the timing might be off.


eri0o

I assure that ogg files are the correct files for things to work correctly, ogg has a reference implementation from xiph and they are well handled.

Mp3 files doesn't calculate the length correctly for all files.

https://github.com/icculus/SDL_sound/commit/495e948b455af48eb45f75cccc060498f1e0e8a2

It assumes all frames are the same size.

Anyway, this is for LENGTH, I am mentioning because if you have a looping music with mp3 it will fail because of this and will have a gap. And also mp3 never ever beat ogg in encoding quality/size tradeoff. The mp3 are always bigger with worse sounding quality.

The sound file doesn't have time data in the way you are mentioning.

Sound plays in it's own thread so it doesn't desync and the timing in MS is correct in my tests.

Anyway, I still recommend labeling using Audacity, exporting the individual sound files and using a dialog system as it's usually hard to read in a hurry.

Khris

Quote from: Sakspapir on Mon 20/03/2023 23:01:31I'm wondering how reliable the above mentioned function is.

Couldn't you simply try it with a, say, two minute audio file? Schedule something for a specific position near the end and see what happens? That should be quicker than us guessing around, and I'd also guess that if the function isn't perfectly reliable, it's not something that can be fixed in the engine code easily because otherwise it would've been implemented like that already.

Looking for solutions or alternatives only makes sense once it's established that the function isn't reliable.

Sakspapir

Quote from: eri0o on Tue 21/03/2023 08:59:20The sound file doesn't have time data in the way you are mentioning.

Sound plays in it's own thread so it doesn't desync and the timing in MS is correct in my tests.

Ok, this sounds promising. Thank you! Also, I will probably change to OGG, I have no doubt it would be a better choice than mp3 if the variable bitrate won't affect AudioChannel.Seek and AudioChannel.PositionMs.

Quote from: Khris on Tue 21/03/2023 08:59:45Couldn't you simply try it with a, say, two minute audio file? Schedule something for a specific position near the end and see what happens? That should be quicker than us guessing around, and I'd also guess that if the function isn't perfectly reliable, it's not something that can be fixed in the engine code easily because otherwise it would've been implemented like that already.

Looking for solutions or alternatives only makes sense once it's established that the function isn't reliable.

I'm not really looking for guesses, but more of the explicit working of the AudioChannel.PositionMs-functionality. I could run a lot of tests, but then I still won't be completely sure and I wouldn't know anything about how it would work on a different setup (different computer, windows version, etc).

Crimson Wizard

#6
PositionMs is calculated based on sound data, not timer. The decoder converts every supported input format to a uniform sound wave format, after which the current position may be calculated, roughly speaking, as (DATA_SIZE_PLAYED / DATA_SIZE_PER_SECOND).

Certain lags cannot be avoided completely, as there's a difference in time between audio playing and game updates, but this difference should not accumulate and would rather stay in some range.



That said, as eri0o mentioned above, in adventure games the conversations are usually done split into pieces for the convenience of a player. Not every player can understand voice well (if it's not their native language), or read the text as fast as you expected. For that reason it may be more convenient to have a pause in between "sections", where text still stays on screen. This behavior may also be configured: whether a conversation continues automatically or awaits for player's input, how long it waits, and so forth. I thought I'd mention this too just in case.

Sakspapir

Quote from: Crimson Wizard on Tue 21/03/2023 10:08:51PositionMs is calculated based on sound data, not timer. The decoder converts every supported input format to a uniform sound wave format, after which the current position may be calculated, roughly speaking, as (DATA_SIZE_PLAYED / DATA_SIZE_PER_SECOND).

Certain lags cannot be avoided completely, as there's a difference in time between audio playing and game updates, but this difference should not accumulate and stay in some range.

Thank you very much, this is exactly what I was hoping would be the case. The lag you mention is neglible, and as long as it doesn't accumulate theres no issue.

Quote from: Crimson Wizard on Tue 21/03/2023 10:08:51That said, as eri0o mentioned above, in adventure games the conversations are usually done split into pieces for the convenience of a player. Not every player can understand voice well (if it's not their native language), or read the text as fast as you expected. For that reason it may be more convenient to have a pause in between "sections", where text still stays on screen. This behavior may also be configured: whether a conversation continues automatically or awaits for player's input, how long it waits, and so forth. I thought I'd mention this too just in case.

I completely agree with this. However, the functionality I'm implementing is kind of a "background" thing. It is sort of idle chit-chat that the player can choose to ignore.

Finally, thanks a lot to everyone for taking their time to answer my questions.

Snarky

Quote from: Sakspapir on Tue 21/03/2023 10:26:28I completely agree with this. However, the functionality I'm implementing is kind of a "background" thing. It is sort of idle chit-chat that the player can choose to ignore.

Even so. There are a lot of benefits to splitting it up and keeping a one-to-one relationship between a line of text dialog and an audio clip. For example, if you were to translate it, you could do so without having to change all the subtitle timings. And you wouldn't have to write a whole bunch of special-purpose code to time the subtitles to the audio. (Though admittedly you would need to write code, or use a pre-written module, to provide queued background speech with voice.) And you would be easily able to vary it more dynamically; so that for example if the player leaves the room and comes back, it resumes at the next line (if you want); or if you try to talk to the person speaking in the background, you can interrupt them, and then they'll be able to resume afterwards with some kind of transition, like "What was I saying? Oh yes..."

Sakspapir

Quote from: Snarky on Tue 21/03/2023 11:56:26Even so. There are a lot of benefits to splitting it up and keeping a one-to-one relationship between a line of text dialog and an audio clip. For example, if you were to translate it, you could do so without having to change all the subtitle timings. And you wouldn't have to write a whole bunch of special-purpose code to time the subtitles to the audio. (Though admittedly you would need to write code, or use a pre-written module, to provide queued background speech with voice.) And you would be easily able to vary it more dynamically; so that for example if the player leaves the room and comes back, it resumes at the next line (if you want); or if you try to talk to the person speaking in the background, you can interrupt them, and then they'll be able to resume afterwards with some kind of transition, like "What was I saying? Oh yes..."

Noted. The point with resuming conversations is especially good, I think. However, I could easily get around this by saving the PositionMs of the audio clip. When the player returns, a different clip plays with "What was I..." and then I can resume playing the longer clip from the saved ms position.

Also, if I were to translate the game (which I will) I would have to create all the voice again, divide it into sentences, convert it into separate clips (though an easy feat with e.g. Audacity) and then import all the voice files, make sure they play in the right order etc. I've been down that road, and it felt very tiresome, so I want to try this approach.

I could even make the player click through conversation at his/hers desired pace, if I just stop the audioclip at the predefined ms-values, and the resume at when a mouse click is registered. I can use the same technique to set a typical text speed parameter from the old Lucas Arts games.


SMF spam blocked by CleanTalk