Text to Speech Web Application - Elinext Case Study

Client

Elinext is an international company that delivers custom software development services. We make software for our customers, and sometimes our products are handy for internal use. This is one of over 20+ software products that is now being used within the company. Like many other solutions we develop for our ecosystem, this one simplifies the everyday life of employees.

This software code will be used to speed up the development of a related custom product (app for transcribing speech to text, or even text to speech web application) for our clients.

Project Description

The project is a web application that transcribes the speech within team meetings into short text descriptions and sends this report to the email of the participants.

Through a simple and user-friendly interface, one can easily:

upload the record of a meeting in convenient formats (i.e. m4a, mov, avi, mp3)
convert video to audio. With the help of AWS Transcribe, audio is later converted to text. Then based on this text, such tools as AWS Bedrock make short text summaries. The system sends these reports to your email.

For better text recognition and precision, it is possible to specify the number of participants at the meeting.Essentially, this was a speech to text app with timely e-mailing of the result. This app for transcribing speech to text was to be used internally, for Elinext employees’ convenience. Later some parts of it might be offered to our clients as part of web application development services we constantly deliver.

Challenges

The company needed a tool for accurate meeting summaries for better communication and record tracking. Our main challenge was to get a voice transcriber app to avoid inefficiency and inaccuracy in converting the meeting transcription process to text.

The solution should handle various audio/video formats and provide concise, precise, and accurate text summaries.

Primary challenges included:

investigation of existing tools for voice transcription (voice to text transcriber)
investigation of existing tools for creating short text descriptions based on various language models
incorporating the selected tools for the project execution

Process

The development process of this text to speech web app could be divided into 5 stages, each of which was successfully completed within a short timeframe.

Planning Stage (2–3 Days)

Objective: Define the scope, features, and technical requirements of the entire project.

Activities: Our engineers gathered the requirements, and identified the core functionalities, such as file upload, transcription, summarization, and email notifications.

Technology Selection: Our team chose Python for backend development, FastAPI for the framework, AWS services for transcription and summarization, and ElasticMails for email delivery.

Deliverables:

Finalized technology stack.

Design Stage (2–3 Days)

Objective: Design the system architecture and user interface.

Activities: System Architecture Design:

Our engineers defined the flow of data, from file upload to email delivery.
Our engineers outlined interactions between components (FastAPI, AWS services, and ElasticMails).
They designed a scalable backend to handle file uploads and processing.

UI/UX Design:

Our team created wireframes for the user interface, focusing on simplicity and ease of use.
They designed forms for file uploads and fields for specifying participants.

Database and Storage Planning:

Our developer planned the use of Amazon S3 only for internal communication between AWS services such as AWS Transcribe (most fitting voice transcriber app) and AWS Bedrock and is not used for permanent storage of internal files.

Deliverables:

Architecture diagrams.
UI wireframes and design mockups.

Development Stage (1–2 Weeks)

Objective: Build the application and integrate all functionalities.

Activities: Backend Development:

Implemented file upload functionality using FastAPI.
Integrated AWS SDK (boto3-client) to interact with Transcribe and Bedrock services.
Developed the workflow for video-to-audio conversion using FFmpeg.
Set up APIs to manage transcription, summarization, and email notifications.
Adding authentication mechanism with Amazon Cognito.

Frontend Development:

Built a simple, responsive interface using HTML and CSS.
Added interactivity (e.g., drag-and-drop file upload) using JavaScript.

Email Integration:

Configured ElasticMails API for sending meeting summaries.

Cloud Setup:

Configured AWS services (Transcribe, Bedrock, S3) for the application.

Deliverables:

Fully functional backend and frontend.
Configured cloud services.
Integrated email notification system.

Testing Stage (2–3 Days)

Objective: Ensure the application works as expected and meets quality standards.

Activities: Unit Testing: tested individual components like file uploads, transcription, summarization, and email delivery.

Integration Testing:

Verified that all components work together seamlessly.
Tested workflows for various input formats (m4a, mp3, mov, avi).

User Testing:

Simulated end-user interactions to validate the UI/UX and overall functionality.

Deliverables:

A bug-free, optimized application ready for deployment.

Deployment and Implementation Stage (2–3 Days)

Objective: Deploy the application and make it available for use.

Activities:

Deployed the backend application using Uvicorn on a cloud-hosted environment (GCP platform).
Configured Amazon S3 for file storage and linked it to the application.
Set up DNS and hosting for the front end.
Monitored the system post-deployment to ensure smooth operation.

Deliverables:

Live application accessible to users.
Finalized documentation for maintenance and future updates.

The project manager and the developer had meetings every two days for a couple of weeks while discussing key architectural decisions of the application prototype. Those decisions included the choice of mail provider, speech recognition technology, and current progress of the development.

Solution

The result of the project is the solution provided by Elinext Meeting Minutes, a text to speech web app. It is a web application designed to transcribe meetings into concise text summaries and deliver them to users via email.

Key Functionalities of the solution include upload feature, video-to-audio conversion tool, speech-to-text conversion tool, text summarization instrument, participant specification feature, and a nice user interface. Let’s analyze these features one-by-one.

Upload Feature

The application allows users to upload meeting recordings in widely used formats (m4a, mov, avi, and mp3). This ensures compatibility with various most popular and wide-spread recording devices and tools.

Video to Audio Conversion

If the uploaded file is a video, the system extracts the audio portion of the file automatically. This step ensures that video files are seamlessly handled without requiring any manual pre-processing.

Privacy and Security

For security reasons, at any step, intermediate data such as converted audio and text are not stored anywhere and are used only within the pipeline of creating a final report with the key points of the meeting.

Speech-to-Text Conversion

The extracted (or uploaded) audio is processed using AWS Transcribe, a robust cloud-based service that is essentially a voice to text transcriber. AWS Transcribe converts spoken words in the audio into text with high accuracy. This conversion supports multiple speakers and accommodates variations in accents or audio quality, enabling precise transcription. Our task wasn’t in finding text to speech web application, so AWS Transcribe would perfectly do all the job.

Text Summarization

The transcribed text is further summarized into a short meaningful description using AWS Bedrock, which specializes in natural language processing tasks. This summary captures the key points of the meeting, making the output more concise and useful for end-users.

Email Delivery

Once the summarization is complete, the system automatically sends the report to the user’s registered email. This ensures that users participating in the meeting receive their summarized meeting review promptly and conveniently.

Participant Specification

For enhanced text recognition and speaker attribution, the user can specify the number of participants in the meeting. This would help AWS Transcribe to differentiate and attribute speech to individual speakers, resulting in more accurate transcriptions.

User Interface

The application is designed with a simple and user-friendly interface to ensure ease of use. Users can upload files, set preferences (e.g., participant count), and receive outputs without requiring technical expertise.

Authentication Flows

Whether it is using managed login or building a custom front end with an AWS SDK for authentication, it's essential to configure the app to support the desired authentication methods.

After the solution was developed, local testing was enabled, including testing with more than hour-long rallies. It’s not a text to speech web application, so these kinds of tests were omitted.

Results

Currently, this project is in the prototype stage. The next steps for the app for transcribing speech to text are:

Improving UX features that are worth adding for ease of use.
Testing with some other video formats and lengths.

The main learning point for our developer in the process was integrating AWS services (Transcribe and Bedrock) and optimizing the performance and scalability of the solution.

As a result of the Elinext Meeting Minutes solution, our teams started to experience several immediate benefits, including time-saving and working efficiency(concise meeting summaries via email shortly after these meetings are very helpful and the absence in need of manual transcription is very effort-saving).

We often deliver top-notch Python web development solutions, so it wouldn’t come as a surprise if teh experienced gained during this project will be relevant in delivering, let’s say a development of text to speech web application anytime soon.