Developing public sample data for developing AI assistance on Mattermost interactions

Just an FYI to our community, we’re working on open sourcing a “corpus” of Mattermost sample data that is designed to be public for use in developing AI scenarios. i.e. people can download a database with conversations in public, private and DM channels and be able to use it passing back context to LLM models to return replies in the context of different users, based on their data in the system.

The initial system will include:

1 - Sample of non-confidential conversations - on a Mattermost server created by Mattermost staff for the purpose of developing AI accelerated workflows. This will include interactions with AI bots in the system, and the development of interactions and features with those bots. No confidential information is to be shared in these interactions.

2 - Non-confidential meeting recordings, transcriptions and summaries - At the outset we’ll be using the self-hosted “Calls” feature in Mattermost to record meetings, and flow the content into LLM backends (initially OpenAI with whisper and ChatGPT) to develop different meeting summarization features. The ultimate goal is to use the context in the sample data to have the LLM backends personalize the summaries. For example, technical people working on project A, B, C, have a different set of bullet points from a meeting than non-technical people working on X, Y, Z, based on their interactions in the Mattermost data.

3 - Scrubbed user data - A live Mattermost system includes names and email addresses, and potentially some other PII, we’ll have a consent to use PII from all users on the system–or maybe an option to scrub or change names as we also scrub out email addresses. What’s important about user data beyond conversations is things like channel and team memberships, emoji reactions, and other data that could inform the personalization of AI results.

The first release of the sample data will be part of the “Mattermost OpenOps” framework as a platform for developing AI accelerated operational workflows while controlling 100% of private data and maintaining portability among different LLM backends.

We’ll be initially running the system on sample data from Mattermost staff only, however over time we may invite select community members, partners, and maybe even willing customers who agree to contributing to a public sample data repository.

The other advantage of being on the sample data generation server is that it’ll have the latest previews of what we’re building with different AI models and our progress towards personalized workplace AI.

More to come, to learn more, follow us on this forum.

1 Like

If you have an @mattermost.com email address you can sign-up for an account at https://corpus.mattermost.io/

2 Likes