Name

Nik Vaessen <nikvaes@gmail.com>

GSoC Project name

Improving WebRTC, Apache Camel and Jitsi

Mentors

Introduction

This is the summary of the work I have done the past 3 months for Google’s 2016 program of Summer of Code (GSoC). A lot has changed from the initial goals I had in May. I had envisioned that I would spend the first part of GSoC on a project in Apache Camel. The second part of GSoC would have been used to work a project in Jitsi and a project on tools for improving WebRTC. Together with my mentors it was quickly decided to change the second project to work only on Jitsi because of time constraints. At the start of the second phase it was also decided to switch to a project working on Jitsi Meet instead of Jitsi.

SIP method MESSAGE in Camel

In the first month of GSoC I worked on implementing the sip MESSAGE method in the camel-sip component of Camel. The feature request can be found in Camel’s issue tracker [1], and the pull request implementing it on Camel’s Github repository [2]. There were some problems with the camel project. One of mentors I was assigned stopped communication from the community bonding phase onwards. My other mentors were neither familiar with the Camel code-base nor with what was actually desired as the outcome of the work to be done and thus could not support me. That is why this work is not completed yet. Daniel Pocock, the issue reporter, took the effort to give feedback on my pull request, but this was after I already started to work on the second project. Thus I concentrated on finishing the second part instead. This means that I still need to go through the feedback and fix the issues pointed out in the comments of the pull request. I will address these issues after GSoC has concluded.

  1. https://issues.apache.org/jira/browse/CAMEL-9190]

  2. https://github.com/apache/camel/pull/1056

What originally was ICE for SIP in Jitsi

Initially my main project would have been working on implementing ICE for SIP calls in Jitsi [3]. This project was harder than expected because it involved an understanding of multiple protocols. As I was still following university classes and had exams, I had trouble learning and understanding the technologies needed, such as ICE and NAT. I did make two small commits while I was getting familiar with the Jitsi code-base, which can be found in [4 and 5]. My mentors and I decided to change to a project with less complexity and fewer external dependencies.

  1. https://github.com/jitsi/jitsi/issues/233

  2. https://github.com/jitsi/jitsi/pull/277

  3. https://github.com/jitsi/jitsi/pull/279

Speech recognition and transcription in Jitsi Meet

During the rest of the summer I worked on a transcription module in Jitsi Meet. Jitsi Meet is a free real-time multi-party videoconferencing web application. This project involved recording the audio from everyone in a conference and somehow transcribe it. The project was split into two components. One was the client-side code in a Jitsi-Meet conference. This code would use an external speech-to-text server to make transcriptions. A transcription is a piece of text with names and timestamps of what people in the conference have said. Because we were unable to find a gratis external service, I wrote a custom speech-to-text server. This was the second part of the transcription project.

Client side

The client side is Javascript code [6]. It consist of several parts. One of these is the recorder. It uses the ?MediaRecorder API [7] to record everyone in the conference. Another part handles communication with the speech-to-text service. It is designed abstractly, so that it is fairly easy to implement a new speech-to-text service. The rest combines these parts into a functioning transcription module.

  1. https://github.com/jitsi/lib-jitsi-meet/pull/215

  2. https://developer.mozilla.org/en/docs/Web/API/MediaRecorder_API

Server side

The speech-to-text server I created is a Java application which uses Jetty [8] as the HTTP server and Sphinx4 [9] as the speech-to-text library. The server can be found in a ?GitHub repository [10]. At the time of this writing, the entire repository is my work.

  1. http://www.eclipse.org/jetty/index.html

  2. http://cmusphinx.sourceforge.net/wiki/tutorialsphinx4

  3. https://github.com/jitsi/Sphinx4-HTTP-server

How does the module work?

The user tells the module to start transcribing. This is currently done in the browser console. It starts by recording the audio. Every audio stream is stored in a separate file. If a Chrome browser is used, this will be a webm file. When the module is told to stop, all files will be send to a speech-to-text service. In this case, that would be a Sphinx HTTP server.

It would be more elegant to send the audio in chunk to the server while the conference is ongoing. Unfortunately, it is very hard to determine when to split up the audio stream and send a chunk. The main problem is determining when someone is done talking and only then split the audio

width=200

Diagram displaying the process of transcribing

Once the server received all files, they will be converted to WAV format, which is required by the Sphinx library. Then the audio will be transcribed. While the transcription is going on, every newly transcribed word will be immediately sent back to client. This is done by using a chunked HTTP response. Note that transcribing an audio file will take roughly two times as long as the audio itself. The accuracy of the transcribed words will be around 50%. This can of course vary between microphone quality, background noise and the accent of the speaker.

When the client has received all the responses of the server, it will first format them into a structure used by the merging algorithm. The algorithm takes an array of every word uttered for everyone in the conference. These arrays are sorted such that the first uttered word is the first element and the last uttered word the last element. Then all arrays will be merged into a single transcription. This transcription includes the names of the speaker and their sentences in a chronological fashion.

Room for improvement?

There are several features which can be added to the module. The transcription could include user events. For example, it could show that a new person entered the conference or that someone muted their audio. Also, the transcription process does not have a user-interface. It has to be accessed via the browser console. Another issue is determining what to exactly do with the transcription and how to supply it to the user. We had several ideas, including sending it to the user by email, uploading it to the description of the Youtube video of the conference, or uploading it to a ?GitHub Gist.

Configuring Jitsi meet to use the module

To use this module, the URL of the speech-to-text service needs to be added into the config.js file of the Jitsi Meet instance. If the Sphinx server is used, the config should contain a variable named “sphinxURL”.

Implementing your own transcription service

If the accuracy of the Sphinx server is not adequate, it is possible to extend the module to use another, more accurate (paid) service. To do so, it is necessary to create a new Javascript class which extends the prototype of the ?TranscriptionService found in ?AbstractTranscriptionService.js [11]. The methods “sendRequest, “formatResponse” and “verify” must be overridden The “sendRequest” method will have the logic needed to send the files to the new service. The “formatResponse” method formats the response of the server to the structure required by the merging algorithm. “Verify” checks if the retrieved response is correct, e.g. not a 404 “Not Found” response.

  1. https://github.com/nikvaessen/lib-jitsi-meet/blob/transcription_test/modules/transcription/transcriptionServices/AbstractTranscriptionService.js

Mailing list reports during GSoC

This summary was written on 2016-08-22