Thursday, November 30, 2017

Extract Attachments Inside or Embedded Within InfoPath XML Forms

SITUATION: It's been a while but I'm back to working more with SharePoint - YAY - and was hit with my first unique request back in 2018: we want to find a way to extract files that are embedded inside of InfoPath attachment controls. Along the way, also had to figure out how to download all files in a library.

RESOLUTION: I apologize now: THERE IS NO OOTB WAY TO JUST GRAB THESE SUCKERS...you must use SOME form of code or script :(. So, I figured out two ways originally and have maintained one way to do this: PowerShell. This way, you can download the files locally and then extract the attachments (does use more disk space) but is probably done on a SharePoint admin server - I welcome some assistance in porting this to O365 so that anyone can run this on their own computer if they have appropriate perms

******CAVEATS******
The ways presented are definitely a beta and should be applied only to a test instance of SharePoint so you don't kill your server or computer. I'm definitely not responsible for you using this script and it messing stuff up - review it, test it, follow your company's standards for verification and testing, etc.
I welcome feedback if you discover a bug in it but feel free to use how you see fit, tweak it, whatever. It'd be nice to keep some credit if you end up doing amazing things with it - just let me know :)!
This was tested using PowerShell on a SharePoint Server 2010 instance. This may or may not work in other versions, I'm sure you could test it and let me know.
******END CAVEATS******

Now, it took me a while to get all the pieces together, so here are the links to the information needed to pull this off <begin credits>:
  1. MS explains how to do this in visual studio code here
  2. Chris White expounds on the same thing in some different detail here
  3. Different stack exchange questions related to pushing and pulling from XML in a form library as well as handling the attachment raw data and rebuilding it here, here, and here
  4. Masanao Izumo has a cross-browser implementation for converting an unsigned 8 byte integer array back to a string (text) here
  5. (2022 update) - MS has some documentation on the byte header that they insert before files that have been embedded within their xml files here <end credits>
Powershell Script - Most recent update was 2022-05-02. Has 2 requirements and 3 optional pieces for which it prompts: 1) URL of the sharepoint web (aka the regular site, not site collection), 2) The name of the form library, and 3) if you are extracting InfoPath forms, what data source in the xml file do you want to use to create folder(s) on your computer in which to place all of the attachments - if it is an XML attribute instead of a proper data source then that is ok too, and 4) and 5) a cutoff date in case you want to only download items before a certain date or after a certain date. The last part that you *might* edit is the $filepath since that is the main path on your computer for where you want to stick all these attachment files. Here's the script link: https://github.com/hornerit/powershell/blob/master/Get-SharePointInfoPathFilesAndAttachments.ps1

Further Explanation: So here's how it pulls together:

  1. InfoPath XML attachment files are first a long string that has been "encoded" into something called Base64. You don't have to know what that means, you just have to know they were encoded - which means to use them we need to decode them. That's the first step once we have the InfoPath XML file.
  2. The attachment files (once decoded) are composed of a header portion and the data portion. The header consists of 2 parts: a fixed header set of information and the filename of the attachment.
  3. Byte 20 of the header tells you just how long the filename is as a number. This is a byte but needs to be doubled for how the information is encoded (unicode or UTF8).
  4. Once we know how long the filename is, we look *just* past the end of the regular, fixed header (byte 24) and copy every other byte till the end of the filename section - except for the very last one as it has a byte to tell the system that this is the end of the file name. The reason we do every other is that the ones in between are basically nothing or what programmers call 'null'. One weird thing - PDFs in my environment didn't have a file extension at the end of their name...so I created a small workaround to just append .pdf if there was not any extension mentioned.
  5. Now that we have the filename, we need to separate out the contents from the header, so we create a new array that just starts after the end of the filename.
  6. This new array needs to be somehow converted into a file...in *Windows* Powershell, we use the WriteAllBytes function to make it a file and we just tell it where the path is that it should write and give it this blob of goodness
  7. Getting access to the file itself is easy for Powershell - when the script is done, go look at your folder. There's the "_DownloadedRawFiles" subfolder, which you can ignore for now, and there's all the other folders generated by the extraction process based on the group by aka Folder Structure Node supplied.

10 comments:

  1. Thank you man! That saved me quite a lot of headaches.
    I've adapted your code into a Vue.js app to extract old InfoPath forms' attachments. Should you (or anyone) need it, just ask.

    ReplyDelete
  2. Works like a charm Brendan! Thank you sooooo much!
    Moving from on prem SP2013 with Infopath to SP online without...

    ReplyDelete
  3. Hi, this could really help me on migrating InfoPath forms and specially the attachments. But, an error is displayed when the attachments are processed: Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error "The '¤' character, hexadecimal value 0xA4, cannot be included in a name. Line 33, position 14." Could you provide some ideas on this error and how to fix it?

    ReplyDelete
    Replies
    1. So this happened before with the SS symbol for subsection in a legal document. If you look at line 410, you can add your own .replace("symbol goes here","") to it and I think that'll fix that strange symbol from being included in the name and that *should* allow it to work.

      Delete
    2. Thank you for your reply. I changed the replace character like this [xml]$xml = (Get-Content $file).Replace("¤","") but in error message output is now [xml]$xml = (Get-Content $file).Replace("¤","") I copy-pasted the character from character map but also via notepad.Both cases produced same message. Obviously something is added via clipboard.

      Delete
    3. Try using the alt input method for the symbol (holding alt and typing the 4 digit code on numeric keypad). If that doesn't work, verify that that is the right place by looking at the line in the error code and let me know. Worst case: if u have a fake one that u can send me the xml file, I can tinker and see if there are other areas needing to be addressed for that special character.

      Delete
  4. Hi, and thank you for your time. Alt numpad fixed the character issue, but now the message is about another character, the embedded attachments are pdf files (one per xml file): Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error: "The '¶' character, hexadecimal value 0xB6, cannot be included in a name. Line 98, position
    19."
    At D:\temp\Get-SharePointInfoPathFilesAndAttachments.ps1:410 char:4
    + [xml]$xml = (Get-Content $file).Replace("¤","")
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvalidCastToXmlDocument

    ReplyDelete
    Replies
    1. You can actually chain the replace commands like .replace("something","").replace("something else","") and just list any special characters that it doesn't support. There are a few more efficient solutions but I would be afraid that they would break functionality at this point.

      Delete
    2. ok, thanks, it helped. I run the script to a lot of 24392 files and here is the error stats: Error stats: 0 attachments failed to be extracted from 0 files
      Extraction stats:
      Total attachments extracted: 2896 (from appx 24392 InfoPath source files)
      Total time to extract data/attachments: 1576.6709125 seconds
      Please press enter to close:

      Regarding the console output, I get 90 messages like Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error: "The 'my:myFields' start tag on line 1 position 42
      5 does not match the end tag of 'my:attribute'. Line 3, position 21."
      At D:\temp\Get-SharePointInfoPathFilesAndAttachments.ps1:410 char:4
      + [xml]$xml = (Get-Content $file).Replace("¤","").Replace(" ...
      + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      + CategoryInfo : InvalidArgument: (:) [], RuntimeException
      + FullyQualifiedErrorId : InvalidCastToXmlDocument

      Could I put a line somewhere in script to output the file names which cause the messages?

      Good job with the script!

      Delete
    3. I went ahead and added some error handling on those lines in the script to at least output which file path it was that had an error. I didn't implement any sort of error output file but you'll have an error message and proper error counting now when it fails to convert certain files to XML. Take a look at the updated script and you'll see around line 410 the new try/catch blocks that will handle it and you can just re-add your replace commands.

      Delete