This week Microsoft added an automated transcription feature to their online Word. It is easy to use and renders a surprisingly accurate transcript that differentiates between speakers and keeps the times of the conversation segments. This is great from a user perspective. Why bother to keep meeting notes when you have a free transcript of everything said? It does present some challenges to your retention policies, regulatory compliance, privacy laws, discovery and more. So let’s deep dive into what and where this new document format is stored as well as how it may impact your holds, collections, processing and review.
First a quick overview of how you create a transcript. Open online Word, expand the Dictate tab and select the Transcription option to open the side menu. You can either start recording live or upload an audio or MP4 video file (Teams/Zoom video format). When you stop recording live the file is processed in Azure and it returns a speaker/time delimited transcript that you can add to your Word content. The audio/video file is stored in a new ‘Transcribed Files’ folder in your OneDrive. The inserted transcript starts with a link to the audio/video that defaults to ‘Your Recording.wav’ file naming. It increments the audio/video file names. The Word file is stored in your root OneDrive folder unless you Save As.
So what is stored in the Word .DOCX compound format file? I blogged about this way back in 2010, but this is on the legacy www.eDJGroupInc.com site. You can just change the file extension to .zip to access the different .XML component files that Office now uses to assemble a document. Buried in the docProps folder we find a new custom.xml file that contains the transcript segments detailing the speaker number, transcribed text and start time. One of these segments is generated for every speaker change or pause. I have not yet tried to parse this into any of the common transcript load file formats, but I doubt that it would be hard for your providers to do.
So how can you preserve or collect these new files? Theoretically you can use the Office 365 Security & Compliance Core eDiscovery functionality to place holds or collect from user OneDrives. See my afterword for ongoing issues encountered. So far I have been able to find and export the transcript but not the audio/video file. The export report does not have any hidden metadata listing the attached audio/video file.
So what are some potential issues?
- Disassociation of the audio/video from the transcript. You will want both. If you are using keywords or other criteria to selectively hold/collect from custodian OneDrives (see afterword note), you may get the transcript .DOCX file and leave the Your Recording.WAV file behind. Few if any collection tools for O365 will run down internal hyperlinks to collect linked files. The audio/video file properties do not contain any references to the .DOCX transcript file and the date/time created do not match.
- Speaker are numbered, but not identified. If you just have the transcript, how can you absolutely attribute a critical statement? Few users are going to take the time to find/replace Speaker 1 for ‘Joe Smith’ when they have a personal memory of the meeting speakers.
- Possible privacy issues with background non-employees. Low risk, but stranger things have happened.
- Transcripts can be easily edited. This applies to those created via online Word and those created by Teams recordings that get saved to Streams with captioning enabled. I will dive into discovery on Teams/Streams soon.
- Many companies have rules/scripts that delete or assign short retention periods to .WAV/.MP4 files. In the past, most audio files on corporate file shares were employee music collections. Now that default OneDrive setting synchronize everything in the user document directory, we see lots of strange personal file types in custodial OneDrive collections. So there could be a serious risk that these files will be expired quickly or filtered out in your default processing.
- Processing security. I have not been able to find a Microsoft document detailing where the files are cached during processing. While I imagine it is securely encrypted, I like to check every box on security.
- Review presentation. If you suddenly get hundreds of transcripts and audio/video files, how will they be presented in review and will it drive up your counsel costs?
These are just some initial thoughts. As a side note, Teams already has a similar automated captioning feature for recordings made by the host and stored in Streams. We have moved so quickly into majority remote work-from-home with the pandemic that I feel like we are in for some surprises when the flood of pent up litigation hits us. Have you encountered Zoom/Teams audio/video files yet? Does your product or processing services have a solution for them? Let me know in a comment or a private email.
*Afterword – During my testing I encountered issues wherein Compliance eDiscovery searches failed to return hits on the new transcript files. Broader searches failed to return ANY OneDrive files. I opened a support ticket and spent a couple hours with Microsoft tech confirming this behavior. They referred the issue to backline tech and promised to call back with an answer. The next morning the same date search returned hits on the transcript .DOCX file, but not the audio .WAV file from the new subfolder. Even full exports of my SharePoint and OneDrive environments have not included the new audio/video files. I have asked a few clients to run this test in their Office 365 environments. It is too early to have any firm conclusions, but this kind of behavior keeps me running acceptance testing and periodic function testing on all cloud systems.
Greg Buckles wants your feedback, questions or project inquiries at Greg@eDJGroupInc.com. Contact him directly for a free 15 minute ‘Good Karma’ call. He solves problems and creates eDiscovery solutions for enterprise and law firm clients.
Greg’s blog perspectives are personal opinions and should not be interpreted as a professional judgment or advice. Greg is no longer a journalist and all perspectives are based on best public information. Blog content is neither approved nor reviewed by any providers prior to being posted. Do you want to share your own perspective? Greg is looking for practical, professional informative perspectives free of marketing fluff, hidden agendas or personal/product bias. Outside blogs will clearly indicate the author, company and any relevant affiliations.
Quick update on the Compliance eDiscovery search issues encountered. Follow up testing and another couple support sessions seem to confirm that the audio/video files stored in the ‘Transcribed Files’ folder are NOT searchable through the eDiscovery UI. I have little doubt that a PowerShell cmdlet with proper syntax might retrieve these kinds of items. Most of my global corporate clients severely restrict PS access for good reasons. I have repeated this issue in one of my client’s environments and the support team indicated that they were going to start a feedback process to development. My big recommendation is to TEST,… Read more »
Update 2: A friend recreated my tests and successfully found the OneDrive files using Core and Advanced eDiscovery searches. It is clear that my O365 tenant has index issues. Once MS Support has resolved the issues I will post an update on how to identify/resolve the issues if they occur in your environment.