RESOLUTION: I apologize now: THERE IS NO OOTB WAY TO JUST GRAB THESE SUCKERS...you must use SOME form of code or script :(. So, I figured out two ways originally and have maintained one way to do this: PowerShell. This way, you can download the files locally and then extract the attachments (does use more disk space) but is probably done on a SharePoint admin server - I welcome some assistance in porting this to O365 so that anyone can run this on their own computer if they have appropriate perms
******CAVEATS******
The ways presented are definitely a beta and should be applied only to a test instance of SharePoint so you don't kill your server or computer. I'm definitely not responsible for you using this script and it messing stuff up - review it, test it, follow your company's standards for verification and testing, etc.
I welcome feedback if you discover a bug in it but feel free to use how you see fit, tweak it, whatever. It'd be nice to keep some credit if you end up doing amazing things with it - just let me know :)!
This was tested using PowerShell on a SharePoint Server 2010 instance. This may or may not work in other versions, I'm sure you could test it and let me know.
******END CAVEATS******
Now, it took me a while to get all the pieces together, so here are the links to the information needed to pull this off <begin credits>:
Further Explanation: So here's how it pulls together:
- MS explains how to do this in visual studio code here
- Chris White expounds on the same thing in some different detail here
- Different stack exchange questions related to pushing and pulling from XML in a form library as well as handling the attachment raw data and rebuilding it here, here, and here
- Masanao Izumo has a cross-browser implementation for converting an unsigned 8 byte integer array back to a string (text) here
- (2022 update) - MS has some documentation on the byte header that they insert before files that have been embedded within their xml files here <end credits>
Further Explanation: So here's how it pulls together:
- InfoPath XML attachment files are first a long string that has been "encoded" into something called Base64. You don't have to know what that means, you just have to know they were encoded - which means to use them we need to decode them. That's the first step once we have the InfoPath XML file.
- The attachment files (once decoded) are composed of a header portion and the data portion. The header consists of 2 parts: a fixed header set of information and the filename of the attachment.
- Byte 20 of the header tells you just how long the filename is as a number. This is a byte but needs to be doubled for how the information is encoded (unicode or UTF8).
- Once we know how long the filename is, we look *just* past the end of the regular, fixed header (byte 24) and copy every other byte till the end of the filename section - except for the very last one as it has a byte to tell the system that this is the end of the file name. The reason we do every other is that the ones in between are basically nothing or what programmers call 'null'. One weird thing - PDFs in my environment didn't have a file extension at the end of their name...so I created a small workaround to just append .pdf if there was not any extension mentioned.
- Now that we have the filename, we need to separate out the contents from the header, so we create a new array that just starts after the end of the filename.
- This new array needs to be somehow converted into a file...in *Windows* Powershell, we use the WriteAllBytes function to make it a file and we just tell it where the path is that it should write and give it this blob of goodness
- Getting access to the file itself is easy for Powershell - when the script is done, go look at your folder. There's the "_DownloadedRawFiles" subfolder, which you can ignore for now, and there's all the other folders generated by the extraction process based on the group by aka Folder Structure Node supplied.