Spread the love

As part of 2019, I want to regularly upload my YouTube videos. Last week we had an accidental stumbling block. Through the updates and adjustments I already use, it managed to break my Windows install. The drivers weren’t working correctly.

As part of the repairs, you do not have to worry at all. I have a fresh install with the latest drivers, like I am supposed to have. The unfortunate side is I spent some time recording 2 videos last week. It was when my computer thought it was a myth to be using my microphone.

With that, we now had to do something with the pre-recorded videos. Sure, I could start again as I save after every episode. That doesn’t feel completely correct, I should rather stick to only spending the time once to record.

No doubt, you will see in the episode of edg3 za on this past weekend, I don’t mind errors. They can be amusing, and slightly funny. So, I can hopefully bring a giggle or two.

This also brought a few other ideas for us to step into for ourselves. This will be a small intro to the use of the systems I managed to work with for these improvements. I share what I saw, and did. This was so we could get all the things I wanted out of this week’s efforts.

Start Testing SpeechSynthesiser

As you may or may not know, Microsoft provides helping code for using speech. Sure, they also help with recognition, today we will just take a look at the speech synthesiser. That essentially is what we can use to make our computer talk to us.

First, we will use a console application and make an effort to work with the synthesiser. We add a reference to System.Speech (by Microsoft) in the project. I start with a console application for the test. Essentially, this will document what I look at, what I test, and what I figure will make the code actually talk to us.

        static void Main(string[] args)
        {
            System.Speech.Synthesis.Prompt prm = new Prompt("Hello, write a line to have me talk.");
            System.Console.ReadLine();
        }

I figure after looking at some of the code: we most likely need a Prompt. It does not have something like “say”, so I figure we can look around for that. Looking at the Synthesis objects, I figure it must be the SpeechSynthesizer itself.

        System.Speech.Synthesis.SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer();
        speechSynthesizer.SelectVoice("");

As you can tell, we don’t know what the voice we should tell it is called. I looked it up, and there is a method GetInstalledVoices.

As you can see, the assumption was half correct. We have 2 languages installed, they are both enabled. Awesome. All we needed to swap was a small piece, then it compiles and runs.

        speechSynthesizer.SelectVoice(col[0].VoiceInfo.Name);

Take note, eventually we can pay attention to the languages, and how it sounds. We aren’t worried. We just need the name for the first step. I look at the available methods, and find an easy option (since we already made the Prompt).

        speechSynthesizer.Speak(prm);

Now my console application talks to me, but let us start testing. I make a loop, and make it read input. Using the input “qqq” it will stop the loop.

        string line = "";
        while (line != "qqq")
        {
            line = Console.ReadLine();
            prm = new Prompt(line);
            speechSynthesizer.Speak(line);
        }

I figured it is best to see if it pauses when saying the word. This happens to work perfectly for us. We can definitely use this to make the computer read a string to us. We will have to do more, it just adds slight difficulties to our further implementation.

        static void Main(string[] args)
        {
            Prompt prm = new Prompt("Hello, write a line to have me talk.");
            SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer();
            var col = speechSynthesizer.GetInstalledVoices();
            speechSynthesizer.SelectVoice(col[0].VoiceInfo.Name);
            speechSynthesizer.Speak(prm);

            string line = "";
            while (line != "qqq")
            {
                line = Console.ReadLine();
                prm = new Prompt(line);
                speechSynthesizer.Speak(line);
            }

            System.Console.ReadLine();
        }

This is minimal, I also know that my naming may be slightly weird. My conventions are strange, however this is just a test. Since what matters actually works, we can keep things simple like this.

Converting to WAV

I was starting the next step, but realised we need to work out how we can convert it to a WAV for the simplest addition to the videos I already recorded.

It turns out we only need a minimal addition to the code above.

        speechSynthesizer.SetOutputToWaveFile("output.wav");

Oh dear, it may cause worry. It worried me for a second, it wasn’t talking back to us when we sent lines of text. After I realised this, I figured I can say some weird things, and then I can see. It is true, the WAV file exists. This worked perfectly for me.

The other thing that it made me think of is slightly unfortunate. I can Google it, I just wanted to stick to using whatever I can find. I haven’t worked out how to pause the voice output. Essentially we will just make a paragraph of what is said, then using splitting in Shotcut we will split it up. Place the sentences in spots in the video.

Factorio Talk

This happens to be for my 2 Factorio episodes pre-recorded last week. It turned out on that day I did not realise my troubles. Now, take note, this should likely be different.

        if (args.Length == 0)
        {
            Console.WriteLine("Please start using the correct format:");
            Console.WriteLine(" .\\YouTubeSpeech.exe ");
            Console.WriteLine();
            return;
        }

I figured I wanted to make it a terminal application for us. Keep it small, keep it simple. That can start with an initial for loop.

        var f = File.ReadAllLines(args[0]);
        for (int i = 0; i < f.Length; i++)
        {
            var line = f[i];
        }

Putting the tested speech code in, it should be simple to understand.

        static void Main(string[] args)
        {
            if (args.Length == 0)
            {
                Console.WriteLine("Please start using the correct format:");
                Console.WriteLine(" .\\YouTubeSpeech.exe ");
                Console.WriteLine();
                return;
            }

            SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer();
            var col = speechSynthesizer.GetInstalledVoices();
            speechSynthesizer.SelectVoice(col[0].VoiceInfo.Name);
            speechSynthesizer.SetOutputToWaveFile("output.wav");
            Prompt prm;

            var f = File.ReadAllLines(args[0]);
            for (int i = 0; i < f.Length; i++)
            {
                var line = f[i];
                prm = new Prompt(line);
                speechSynthesizer.Speak(prm);
            }

            Console.WriteLine("Complete.");
            Console.ReadLine();
        }

As simple as that, now it is time to test it out. Let us see what we get from this. Testing it with a simple set, it works out.

	I am an ostrich.
	Only on tuesdays.
	What if I can talk?

I do hear that there is stuttering, the sound isn't perfect. I just want to remove the stutter the easiest way that I can. Audacity, noise cancellation. After that, the audio isn't completely perfect still, however, I can use it for the videos without audio now.

The Next Step

Well, I now have a way I should be able to talk in videos if I ever cant talk in them again. This is at least awesome, now we can move to the next idea I have. It doesn't need to be something that is added to my recordings, don't get me wrong. I just always promise things, suggest things, and say things. I want to hold myself to them.

What This Brings

It is nice to step into uses of resources I know we have, that makes this very simple. This can easily be used in several ways. So, we will step into voice recognition for our selves. Looking through the documentation, it easily shows what all we will need.

        static void Main(string[] args)
        {
            using (SpeechRecognitionEngine engine = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-ZA")))
            {
                engine.LoadGrammar(new DictationGrammar());

                engine.SpeechRecognized += Engine_SpeechRecognized;
                engine.SetInputToDefaultAudioDevice();
                engine.RecognizeAsync(RecognizeMode.Multiple);

                while (Console.ReadLine() != "qqq")
                {

                }
            }
        }

        private static void Engine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            throw new NotImplementedException();
        }

As you can see, we just started relying on the documentation with a few personal adjustments. I wanted to call the Engine an actual "engine". Similarly, I personally wanted South African English, I figured it was "en-ZA". I suppose I should look it up, I just assumed it so as it makes sense logically.

We will definitely use the recognition from the documentation to test this first.

		Console.WriteLine("Recognized text: " + e.Result.Text); 

This should see more, but you will note that it only once in these sentences recognized what I actually said.

		Recognized text: The
		Recognized text: that's what
		Recognized text: we've got
		Recognized text: .
		Recognized text: This does not seem to work
		Recognized text: on the

Moving over to "en-US" didn't help at all, neither did "en-UK". I figure it may be due to the small cover I have for my microphone so I take it off. It so happens that speech recognition was worse. I even made the while loop exactly like the docs.microsoft page shares, it doesn't seem to work.

Looking further I figure I should try out the CodeProject page's sharing. It needed slight adjustments for our use of the same SpeechRecognitionEngine.

		using System.Speech.Recognition;
		using System.Speech.Synthesis;
		using System.Threading;

This seems like a minimal adjustment, it seems we need to specify the grammar ourselves. It should be asked to look for "test".

        static void Main(string[] args)
        {
            SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine();

            Grammar gr = new Grammar(new GrammarBuilder("test"));
            gr.Name = "testGrammar";
            _recognizer.LoadGrammar(gr);

            _recognizer.SetInputToDefaultAudioDevice();
            _recognizer.SpeechRecognized += _recognizer_SpeechRecognized;
            _recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true)
            {
                if (Console.ReadLine() == "qqq") return;
            }
        }

        private static void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            if (e.Result.Text == "test") // e.Result.Text contains the recognized text
            {
                Console.WriteLine("The test was successful!");
            }
        }

This is nice and dandy, it just doesn't really completely work for us, we don't want it to rely on a single word. This becomes a very simple test, we only have minimal code to add.

        gr = new Grammar(new GrammarBuilder("I must remember"));
        gr.Name = "rememberGrammar";
        _recognizer.LoadGrammar(gr);

Also within the _recognizer_SpeechRecognized it is very simple:

        else
        {
            Console.WriteLine(e.Result.Text);
        }

I tested this. It doesn't seem to continue the recognition past saying "I must remember", it works perfectly for recognising it. Just not completely how I want to use the recognition. Throught experimentation, we cant call to listen to another segment in that part of the grammar.

Experimentation continues, surely we should be able to make a second speech handler? I add in a second class, start making it all line up, and then see how it goes.

Well, I am confused. It apparently doesn't want to recognize my voice. It returns null, instead of what I say.

		Reading Line:
				CK
		Reading Line:
				a

So thinking about it I need to remove the 15 second limit, after doing that it works. Sure, it isn't perfect yet, it only needed to be a step in the correct direction again.

    class Program
    {
        private static SpeechRecognitionEngine _secondEngine;

        static void Main(string[] args)
        {
            _secondEngine = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
            _secondEngine.LoadGrammar(new DictationGrammar());
            //_secondEngine.SpeechRecognized += ...;
            _secondEngine.SetInputToDefaultAudioDevice();

            SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine();

            Grammar gr = new Grammar(new GrammarBuilder("test"));
            gr.Name = "testGrammar";
            _recognizer.LoadGrammar(gr);

            gr = new Grammar(new GrammarBuilder("I must remember"));
            gr.Name = "rememberGrammar";
            _recognizer.LoadGrammar(gr);

            _recognizer.SetInputToDefaultAudioDevice();
            _recognizer.SpeechRecognized += _recognizer_SpeechRecognized;
            _recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true)
            {
                if (Console.ReadLine() == "qqq") return;
            }
        }
        
        private static void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            if (e.Result.Text == "test") // e.Result.Text contains the recognized text
            {
                Console.WriteLine("The test was successful!");
            }
            else if (e.Result.Text == "I must remember")
            {
                Console.WriteLine("Reading Line: ");
                var engine = (SpeechRecognitionEngine)sender;
                var answer = _secondEngine.Recognize();
                Console.WriteLine("\t" + answer.Text);
            }
        }
    }

As the conclusion, I have been trying a lot. It doesn't seem to ever want to work. Through discussions and documentation, it is clear that "en-US" is the fastest and most reliable option currently. Perfectly fine, we should totally be using that. Unfortunately, this requires us to wait a second or two when we say "I must remember" before we say the rest for it to even start getting anything we say. Then we need to try rely on "en-US" for giving us the rest of the sentence. The delay means we have to pause every time for a second, then our sentence needs to hopefully be hear. It currently isn't.

		Reading Line:
				She
		Reading Line:
				GK

Both of those lines are trying to say "I must remember" … "to eat cake". We wait for the "Reading Line:" to show, then say "to eat cake". As you can see even with the stumbling block of the delay, we just can't use it like this. I figure I should probably make a normal loop instead of making a static method for it to call? We would have to process the messages differently then. I figure this is where I will leave the voice recognition for us currently.

Final Conclusion

The finale here is I managed to quite simply, and quite easily, get text to speech for the audio for the Factorio episodes. It turns out that the speech recognition within .NET happens to not be completely foolproof yet. Sometimes, it sees the exact sentence I say. Other times it gives a slightly confusing prediction.

You can obviously see on the Factorio episode we have a robot helping describe what we do, for a change. I hope it is enough.