DEV Community

Cover image for How I built NotesGPT – a full-stack AI voice note app
Hassan El Mghari
Hassan El Mghari

Posted on • Originally published at stack.convex.dev

How I built NotesGPT – a full-stack AI voice note app

Last week, I launched notesGPT, a free and open source voice note app that has 35,000 visitors, 7,000 users, and over 1,000 GitHub stars so far in the last week. It allows you to record a voice note, transcribes it uses Whisper, and uses Mixtral via Together to extract action items and display them in an action items view. It’s also fully open source and comes equipped with authentication, storage, vector search, action items, and is fully responsive on mobile for ease of use.

I’m going to to walk you through exactly how I built it.

Architecture and tech stack

This is a quick diagram for the architecture. We’ll be discussing each piece in more depth and also showing code examples as we go.

Architecture Diagram

Here’s the overall tech stack I used:

Landing Page

The first piece of the app is the landing page you see when you navigate to notesGPT.

Landing page of notesGPT

The first things users see is this landing page which along with the rest of the app, was built with Next.js and with Tailwind CSS for styling. I enjoy using Next.js since it makes it it easy to spin up web apps and just write React code. Tailwind CSS is great too since it allows you to iterate quickly on your web pages while staying in the same file as your JSX.

Authentication with Clerk and Convex

When the user clicks either of the buttons on the homepage, they get directed to the sign in screen. This is powered by Clerk, an easy authentication solution that integrates well with Convex, which is what we’ll be using for our entire backend including cloud functions, database, storage, and vector search.

Auth page

Clerk and Convex are both easy to setup. You can simply create an account on both services, install their npm libraries, run npx convex dev to setup your convex folder, and create a ConvexProvider.ts file as seen below to wrap your app with.

'use client';

import { ReactNode } from 'react';
import { ConvexReactClient } from 'convex/react';
import { ConvexProviderWithClerk } from 'convex/react-clerk';
import { ClerkProvider, useAuth } from '@clerk/nextjs';

const convex = new ConvexReactClient(process.env.NEXT_PUBLIC_CONVEX_URL!);

export default function ConvexClientProvider({
  children,
}: {
  children: ReactNode;
}) {
  return (
    <ClerkProvider
      publishableKey={process.env.NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY!}
    >
      <ConvexProviderWithClerk client={convex} useAuth={useAuth}>
        {children}
      </ConvexProviderWithClerk>
    </ClerkProvider>
  );
}
Enter fullscreen mode Exit fullscreen mode

Check out the Convex Quickstart and the Convex Clerk auth section for more details.

Setting up our schema

You can use Convex with or without a schema. In my case, I knew the structure of my data and wanted to define it so i did so below. This also gives you a really nice type-safe API to work with when interacting with your database. We’re defining two tables – a notes table to store all voice note information and actionItems table for extracted action items. We’ll also define indexes to be able to quickly query the data by userId and noteId.

import { defineSchema, defineTable } from 'convex/server';
import { v } from 'convex/values';

export default defineSchema({
  notes: defineTable({
    userId: v.string(),
    audioFileId: v.string(),
    audioFileUrl: v.string(),
    title: v.optional(v.string()),
    transcription: v.optional(v.string()),
    summary: v.optional(v.string()),
    embedding: v.optional(v.array(v.float64())),
    generatingTranscript: v.boolean(),
    generatingTitle: v.boolean(),
    generatingActionItems: v.boolean(),
  })
    .index('by_userId', ['userId'])
    .vectorIndex('by_embedding', {
      vectorField: 'embedding',
      dimensions: 768,
      filterFields: ['userId'],
    }),
  actionItems: defineTable({
    noteId: v.id('notes'),
    userId: v.string(),
    task: v.string(),
  })
    .index('by_noteId', ['noteId'])
    .index('by_userId', ['userId']),
});
Enter fullscreen mode Exit fullscreen mode

Dashboard

Now that we have our backend and authentication setup along with our schema, we can take a look at fetching data. After signing into the app, users can view their dashboard which lists all of the voice notes they’ve recorded.

Dashboard

To do this, we first define a query in the convex folder that uses auth to take in a userId, verify it’s valid, and returns all the notes that match a user’s userId.

export const getNotes = queryWithUser({
  args: {},
  handler: async (ctx, args) => {
    const userId = ctx.userId;
    if (userId === undefined) {
      return null;
    }
    const notes = await ctx.db
      .query('notes')
      .withIndex('by_userId', (q) => q.eq('userId', userId))
      .collect();

    const results = Promise.all(
      notes.map(async (note) => {
        const count = (
          await ctx.db
            .query('actionItems')
            .withIndex('by_noteId', (q) => q.eq('noteId', note._id))
            .collect()
        ).length;
        return {
          count,
          ...note,
        };
      }),
    );

    return results;
  },
});
Enter fullscreen mode Exit fullscreen mode

After this, we can call this getNotes query with a user’s authentication token via a function that convex provides to display all the user’s notes in the dashboard. We’re using server side rendering to fetch this data on the server then passing it into the <DashboardHomePage /> client component. This also ensures that the data stays up to date on the client as well.

import { api } from '@/convex/_generated/api';
import { preloadQuery } from 'convex/nextjs';
import DashboardHomePage from './dashboard';
import { getAuthToken } from '../auth';

const ServerDashboardHomePage = async () => {
  const token = await getAuthToken();
  const preloadedNotes = await preloadQuery(api.notes.getNotes, {}, { token });

  return <DashboardHomePage preloadedNotes={preloadedNotes} />;
};

export default ServerDashboardHomePage;
Enter fullscreen mode Exit fullscreen mode

Recording a voice note

Initially, users won’t have any voice notes on their dashboard so they can click the “record a new voice note” button to record one. They’ll see the following screen that will allow them to record.

Record a voice note page

This will record a voice note using native browser APIs, save the file in Convex file storage, then send it to Whisper through Replicate to be transcribed. The first thing we do is define a createNote mutation in our convex folder that will take in this recording, save some information in the Convex database, then call the whisper action.

export const createNote = mutationWithUser({
  args: {
    storageId: v.id('_storage'),
  },
  handler: async (ctx, { storageId }) => {
    const userId = ctx.userId;
    let fileUrl = (await ctx.storage.getUrl(storageId)) as string;

    const noteId = await ctx.db.insert('notes', {
      userId,
      audioFileId: storageId,
      audioFileUrl: fileUrl,
      generatingTranscript: true,
      generatingTitle: true,
      generatingActionItems: true,
    });

    await ctx.scheduler.runAfter(0, internal.whisper.chat, {
      fileUrl,
      id: noteId,
    });

    return noteId;
  },
});
Enter fullscreen mode Exit fullscreen mode

The whisper action is seen below. It uses Replicate as the hosting provider for Whisper.

export const chat = internalAction({
  args: {
    fileUrl: v.string(),
    id: v.id('notes'),
  },
  handler: async (ctx, args) => {
    const replicateOutput = (await replicate.run(
      'openai/whisper:4d50797290df275329f202e48c76360b3f22b08d28c196cbc54600319435f8d2',
      {
        input: {
          audio: args.fileUrl,
          model: 'large-v3',
          translate: false,
          temperature: 0,
          transcription: 'plain text',
          suppress_tokens: '-1',
          logprob_threshold: -1,
          no_speech_threshold: 0.6,
          condition_on_previous_text: true,
          compression_ratio_threshold: 2.4,
          temperature_increment_on_fallback: 0.2,
        },
      },
    )) as whisperOutput;

    const transcript = replicateOutput.transcription || 'error';

    await ctx.runMutation(internal.whisper.saveTranscript, {
      id: args.id,
      transcript,
    });
  },
});
Enter fullscreen mode Exit fullscreen mode

Also, all these files can be seen in the Convex dashboard under “Files”.

Convex dashboard

Generating action items

After the user finishes recording their voice note and it gets transcribed via whisper, the output is then passed into Together AI. We show this loading screen in the meantime.

Page loading

We first define a schema that we want our output to be in. We then pass this schema into our Mixtral model hosted on Together.ai with a prompt to identify a summary of the voice note, a transcript, and generate action items based on the transcript. We then save all this information to the Convex database. To do this, we create a Convex action in the convex folder.

// convex/together.ts

const NoteSchema = z.object({
  title: z
    .string()
    .describe('Short descriptive title of what the voice message is about'),
  summary: z
    .string()
    .describe(
      'A short summary in the first person point of view of the person recording the voice message',
    )
    .max(500),
  actionItems: z
    .array(z.string())
    .describe(
      'A list of action items from the voice note, short and to the point. Make sure all action item lists are fully resolved if they are nested',
    ),
});

export const chat = internalAction({
  args: {
    id: v.id('notes'),
    transcript: v.string(),
  },
  handler: async (ctx, args) => {
    const { transcript } = args;
      const extract = await client.chat.completions.create({
        messages: [
          {
            role: 'system',
            content:
              'The following is a transcript of a voice message. Extract a title, summary, and action items from it and answer in JSON in this format: {title: string, summary: string, actionItems: [string, string, ...]}',
          },
          { role: 'user', content: transcript },
        ],
        model: 'mistralai/Mixtral-8x7B-Instruct-v0.1',
        response_model: { schema: NoteSchema, name: 'SummarizeNotes' },
        max_tokens: 1000,
        temperature: 0.6,
        max_retries: 3,
      });
      const { title, summary, actionItems } = extract;

      await ctx.runMutation(internal.together.saveSummary, {
        id: args.id,
        summary,
        actionItems,
        title,
      });
});
Enter fullscreen mode Exit fullscreen mode

When Together.ai responds, we get this final screen which lets users toggle between their transcript and a summary on the left, and see and check off the action items on the right.

Full voice note page

Vector Search

The final piece of the app is vector search. We’re using Together.ai embeddings to embed the transcripts and make it possible for folks to search in the dashboard based on the semantic meaning of the transcripts.

We do this by creating a similarNotes action in the convex folder that takes in a user’s search query, generates an embedding for it, and finds the most similar notes to display on the page.

export const similarNotes = actionWithUser({
  args: {
    searchQuery: v.string(),
  },
  handler: async (ctx, args): Promise<SearchResult[]> => {
        // 1. Create the embedding
    const getEmbedding = await togetherai.embeddings.create({
      input: [args.searchQuery.replace('/n', ' ')],
      model: 'togethercomputer/m2-bert-80M-32k-retrieval',
    });
    const embedding = getEmbedding.data[0].embedding;

    // 2. Then search for similar notes
    const results = await ctx.vectorSearch('notes', 'by_embedding', {
      vector: embedding,
      limit: 16,
      filter: (q) => q.eq('userId', ctx.userId), // Only search my notes.
    });

    return results.map((r) => ({
      id: r._id,
      score: r._score,
    }));
  },
});
Enter fullscreen mode Exit fullscreen mode

Conclusion

Just like that, we’ve built a production-ready full-stack AI app ready with authentication, a database, storage, and APIs. Feel free to check out notesGPT to generate action items from your notes or the GitHub repo for reference. And if you had any questions, shoot me a DM and I’d be more than happy to answer it!

Top comments (18)

Collapse
 
writtinfool profile image
writtinfool

Tested it on my Android phone and I'm very impressed. There is a latency issue where no text is processed until you stop recording, and then there is a several second delay until the transcription appears. I'm sure there are ways to do segment processing to overcome this. It is a great 1.0 version. I think when it gets to 3.1.1 it will be a winner.

Collapse
 
nutlope profile image
Hassan El Mghari

Thank you! Yes I agree, definitely still a lot to work on to get it just right

Collapse
 
nutlope profile image
Hassan El Mghari

Thanks! I keep a Notion doc of ideas anytime I see something cool on the internet or think of something and just keep refining it

Collapse
 
mrlinxed profile image
Mr. Linxed

Really interesting concept. I like the idea of just rambling to your phone and it'll automatically figure out your daily schedule/todo items 😊

Collapse
 
ranjancse profile image
Ranjan Dailata • Edited

Heard about Conversational Intelligence or Conversational AI?

Collapse
 
camilanfreitas profile image
Camila Freitas

Congrats! I tested NotesGPT in Portuguese and the transcription was very good. The summary and action items are in English, which I don't see as a problem because they make sense. A very cool project

Collapse
 
flowzai profile image
Flowzai

I'm new in the web development field. Thank you for your valuable content.

Collapse
 
miraculixx profile image
miraculixx • Edited

That's awesome! Congrats! Where/how did you launch & promote to get these numbers in just 1 week? Seriously curios, want to learn. Any pointers would be most appreciated. Thanks!

Collapse
 
codewithshahan profile image
Programming with Shahan

Awesome project. I love it.

Collapse
 
devella profile image
Daniella Elsie E.

This is lovely

Collapse
 
junepeng profile image
June Peng

great! thanks for sharing.

Collapse
 
vitalinahl profile image
Vitalina Hlukhenka

are there any limitations on the recording time? I created a recording ~30 min and didn't get any summary or action items. Did I do something wrong or the recording was too long?

Collapse
 
nutlope profile image
Hassan El Mghari

Sorry to hear that! It should be able to handle up to 30-45 minutes since that's how much context the LLM model I'm using can handle to summarize it so you may have just hit the limit. I can add support for longer voice notes later on too!

Collapse
 
vitalinahl profile image
Vitalina Hlukhenka

Thanks! Would be amazing

Some comments may only be visible to logged-in visitors. Sign in to view all comments.