Thursday, November 30, 2017

Extract Attachments Inside or Embedded Within InfoPath XML Forms

SITUATION: It's been a while but I'm back to working more with SharePoint - YAY - and was hit with my first unique request back in 2018: we want to find a way to extract files that are embedded inside of InfoPath attachment controls. Along the way, also had to figure out how to download all files in a library.

RESOLUTION: I apologize now: THERE IS NO OOTB WAY TO JUST GRAB THESE SUCKERS...you must use SOME form of code or script :(. So, I figured out two ways originally and have maintained one way to do this: PowerShell. This way, you can download the files locally and then extract the attachments (does use more disk space) but is probably done on a SharePoint admin server - I welcome some assistance in porting this to O365 so that anyone can run this on their own computer if they have appropriate perms

******CAVEATS******
The ways presented are definitely a beta and should be applied only to a test instance of SharePoint so you don't kill your server or computer. I'm definitely not responsible for you using this script and it messing stuff up - review it, test it, follow your company's standards for verification and testing, etc.
I welcome feedback if you discover a bug in it but feel free to use how you see fit, tweak it, whatever. It'd be nice to keep some credit if you end up doing amazing things with it - just let me know :)!
This was tested using PowerShell on a SharePoint Server 2010 instance. This may or may not work in other versions, I'm sure you could test it and let me know.
******END CAVEATS******

Now, it took me a while to get all the pieces together, so here are the links to the information needed to pull this off <begin credits>:
  1. MS explains how to do this in visual studio code here
  2. Chris White expounds on the same thing in some different detail here
  3. Different stack exchange questions related to pushing and pulling from XML in a form library as well as handling the attachment raw data and rebuilding it here, here, and here
  4. Masanao Izumo has a cross-browser implementation for converting an unsigned 8 byte integer array back to a string (text) here
  5. (2022 update) - MS has some documentation on the byte header that they insert before files that have been embedded within their xml files here <end credits>
Powershell Script - Most recent update was 2022-05-02. Has 2 requirements and 3 optional pieces for which it prompts: 1) URL of the sharepoint web (aka the regular site, not site collection), 2) The name of the form library, and 3) if you are extracting InfoPath forms, what data source in the xml file do you want to use to create folder(s) on your computer in which to place all of the attachments - if it is an XML attribute instead of a proper data source then that is ok too, and 4) and 5) a cutoff date in case you want to only download items before a certain date or after a certain date. The last part that you *might* edit is the $filepath since that is the main path on your computer for where you want to stick all these attachment files. Here's the script link: https://github.com/hornerit/powershell/blob/master/Get-SharePointInfoPathFilesAndAttachments.ps1

Further Explanation: So here's how it pulls together:

  1. InfoPath XML attachment files are first a long string that has been "encoded" into something called Base64. You don't have to know what that means, you just have to know they were encoded - which means to use them we need to decode them. That's the first step once we have the InfoPath XML file.
  2. The attachment files (once decoded) are composed of a header portion and the data portion. The header consists of 2 parts: a fixed header set of information and the filename of the attachment.
  3. Byte 20 of the header tells you just how long the filename is as a number. This is a byte but needs to be doubled for how the information is encoded (unicode or UTF8).
  4. Once we know how long the filename is, we look *just* past the end of the regular, fixed header (byte 24) and copy every other byte till the end of the filename section - except for the very last one as it has a byte to tell the system that this is the end of the file name. The reason we do every other is that the ones in between are basically nothing or what programmers call 'null'. One weird thing - PDFs in my environment didn't have a file extension at the end of their name...so I created a small workaround to just append .pdf if there was not any extension mentioned.
  5. Now that we have the filename, we need to separate out the contents from the header, so we create a new array that just starts after the end of the filename.
  6. This new array needs to be somehow converted into a file...in *Windows* Powershell, we use the WriteAllBytes function to make it a file and we just tell it where the path is that it should write and give it this blob of goodness
  7. Getting access to the file itself is easy for Powershell - when the script is done, go look at your folder. There's the "_DownloadedRawFiles" subfolder, which you can ignore for now, and there's all the other folders generated by the extraction process based on the group by aka Folder Structure Node supplied.