Thursday, November 30, 2017

Extracting/Downloading Attachments inside InfoPath browser forms

SITUATION: It's been a while but I'm back to working more with SharePoint - YAY - and was hit with my first unique request: we want to find a way to extract files that are inside of InfoPath attachment controls.

RESOLUTION: I apologize now: THERE IS NO OOTB WAY TO JUST GRAB THESE SUCKERS...gotta use SOME form of code or script :(. So, I figured out two ways to do this: a powershell way, for the SharePoint admins that can access the server directly, and a javascript way that works for anyone who has access to the form libraries.

******CAVEATS******
The ways presented are definitely a beta and should be applied only to a test instance of SharePoint so you don't kill your server or computer. I'm definitely not responsible for you using this script and it messing stuff up - review it, test it, follow your company's standards for verification and testing, etc.
I welcome feedback if you discover a bug in it but feel free to use how you see fit, tweak it, whatever. It'd be nice to keep some credit if you end up doing amazing things with it - just let me know :)!
This was tested using Chrome against a SharePoint Server 2010 instance. This may or may not work in other versions, I'm sure you could test it and let me know.
******END CAVEATS******

Now, it took me a while to get all the pieces together, so here are the links to the information needed to pull this off <begin credits>:
  1. MS explains how to do this in visual studio code here
  2. Chris White expounds on the same thing in some different detail here
  3. Different stack exchange questions related to pushing and pulling from XML in a form library as well as handling the attachment raw data and rebuilding it here, here, and here
  4. Masanao Izumo has a cross-browser implementation for converting an unsigned 8 byte integer array back to a string (text) here <end credits>
Powershell Script (skip to the bottom to see explanation of how this all came together). Has 4 requirements for which it prompts: 1) URL of the sharepoint web (aka the regular site, not site collection), 2) The name of the form library, 3) which data sources in the InfoPath form have attachments in them that you want to extract - you supply them in a single textbox separated by commas (e.g. Attachment,Photo,form_optional_attachment), and 4) which node has a value that you want to use to make folders on the local computer for all of the attachments it finds inside that file. The last part that you *might* edit is the $filepath since that is the main path on your computer for where you want to stick all these attachment files. I have intentionally left a test loop in there that will stop after 2 xml files are processed - just remove the lines that have the comment about "TEST ITERATION LOOP" if you want to run it on every file in a library. Here's the script:

<# SCRIPT TO EXTRACT INFOPATH ATTACHMENTS FROM WITHIN XML FILES ON SHAREPOINT
REQUIRES YOU TO SUPPLY THE URL, LIBRARY NAME, DATA SOURCES THAT HOLD ATTACHMENTS YOU WANT, AND AN ORGANIZING DATA SOURCE FOR THE FOLDERS
THIS SCRIPT WILL COPY THE ATTACHMENTS ONTO YOUR LOCAL MACHINE #>
if ((Get-PSSnapin "Microsoft.SharePoint.PowerShell" -ErrorAction SilentlyContinue) -eq $null) 
{
    Add-PSSnapin "Microsoft.SharePoint.PowerShell"
}
function ReadXMLFromLibrary{
 #Gather and set parameters
 $siteurl = read-host "What is the url to the site in question?"
 $libraryname = read-host "What is the form library name?"
 $attachmentNodes = read-host "Please enter the names of the data sources, separated ONLY by commas, where attachment controls exist (not case sensitive)"
 $folderStructureNode = read-host "Please enter the name of the primary data source that will be used to create folders for each form attachments (not case sensitive)"
 $folderStructureNode = $folderStructureNode.ToLower()
 $arrAttachmentNodes = $attachmentNodes.ToLower().split(",")
 $filepath = "C:\InfoPathExtractions\" #filepath is the folder where you want the attachments to be dumped, LEAVE THE TRAILING BACKSLASH
 $web = Get-SPWeb -Identity $siteurl
 $myFiles = $web.Folders[$libraryname]
 $temp = 0 #TEST ITERATION LOOP, REMOVE ALL TEST ITERATION LOOP LINES TO TRY EVERY DOCUMENT IN THE LIBRARY INSTEAD OF ONLY 2
 foreach($file in $myFiles.Files){
  write-host $file.name is being processed
  if($temp -lt 2){ #TEST ITERATION LOOP, REMOVE ALL TEST ITERATION LOOP LINES TO TRY EVERY DOCUMENT IN THE LIBRARY INSTEAD OF ONLY 2
   $document = $libraryname + "/" + $file.Name
   $filedata = $web.GetFile($document)
   $data = $filedata.OpenBinary()
   $encode = New-Object System.Text.ASCIIEncoding
   $temp++ #TEST ITERATION LOOP, REMOVE ALL TEST ITERATION LOOP LINES TO TRY EVERY DOCUMENT IN THE LIBRARY INSTEAD OF ONLY 2
   if($encode.GetString($data[0]) -eq "?"){ #Some of our XML files have 3 illegal characters at the beginning, this solves that and converts binary to XML file
    $xml = [xml]($encode.GetString($data).substring(3))
   } else {
    $xml = [xml]($encode.GetString($data))
   }
   foreach($node in $arrAttachmentNodes){ #These are the nodes specified above and will check the XML file for that node within the node it uses to contain all the actual InfoPath data sources, checks if it has a value, then tries to recreate the original attachment file
    $myNodes = $xml.SelectNodes("//*[translate(local-name(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='$node']")
    for($j=0;$j -lt $myNodes.Count;$j++){
     $b64 = $myNodes.Item($j) | select-object -ExpandProperty "#text" -ErrorAction SilentlyContinue
     $bytes = [Convert]::FromBase64String($b64)
     if($bytes.length -gt 0){
      $arrFileNameBytes = @()
      $fileNameByteLen = $bytes[20]*2 #When the attachment is broken into byte strings, the 20th byte tells you how many bytes are used for the filename. Multiply by 2 for ASCII encoding
      for($i=0;$i -lt $fileNameByteLen-2;$i+=2){ #Recreate the filename using every other char from filename bytes
       $arrFileNameBytes+=$bytes[24+$i]
      }
      $arrFileContentBytes = @()
      $fileContentByteLen = $bytes.length-(24+$fileNameByteLen) #Determine content length by Total - Header - Filename
      $arrFileContentBytes = $bytes[(24+$fileNameByteLen)..($fileContentByteLen+24+$fileNameByteLen)] #Create new array by cloning the content bytes into new array
      $fileName = [System.Text.Encoding]::ASCII.GetString($arrFileNameBytes)
      $fileName = $fileName.trim()
      if(($fileName.indexOf(".",$fileName.length - 5)) -eq -1){ $FileName = $FileName.pdf}
      $folderNameNode = $xml.SelectSingleNode("//*[translate(local-name(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='$folderStructureNode']")
      $folderName = $folderNameNode | Select-Object -ExpandProperty "#text"
      $folderName +="/"
      if(!(test-path $filepath+$folderName -PathType Container)){
       New-Item -ItemType Directory -Force -Path $filepath$folderName
      }
      [IO.File]::WriteAllBytes($filepath+$folderName+$fileName,$arrFileContentBytes)
     }
    }
   }
  }
 }
 $web.dispose()
}
ReadXMLFromLibrary

Javascript version (for all you regular people out there ^_^, skip to the bottom to see explanation of how this all came together). This one only needs two things: the URL to the form library and a comma-separated string of text for each of the files you want to access (e.g. filename1.xml,filename7.xml,myForm2017-11-30T11_01_01.xml). You can get that list of names by creating a view in for the library that shows just the "Name(for use in forms)" column and exporting that to Excel. I usually just make sure all of them end up in cells next to each other and save as a CSV. Here's the script:

/* SCRIPT TO EXTRACT INFOPATH ATTACHMENTS FROM WITHIN INFOPATH XML FILES
** AND APPEND A LINK TO DOWNLOAD THEM TO A WEBPAGE
** Credit: Base64 to array from https://stackoverflow.com/questions/21797299/convert-base64-string-to-arraybuffer
** Credit: UTF8 Array to Str from https://ourcodeworld.com/articles/read/164/how-to-convert-an-uint8array-to-string-in-javascript (Masanao Izumo is the author)
** Created by: Brendan Horner www.hornerit.com 11/29/2017
** You can use this in the console if you load up any SharePoint page - possibly even throw this on an empty page in SharePoint.
*/
//These initial functions load a few utility functions for converting a base64 string of numbers into a file
function base64ToArray(base64) { var binary_string = window.atob(base64); var len = binary_string.length; var bytes = new Uint8Array( len ); for (var i = 0; i < len; i++){ bytes[i] = binary_string.charCodeAt(i); } return bytes; }
function Utf8ArrayToStr(array) { var out; var i; var len; var c; var char2; var char3; out = ""; len = array.length; i = 0; while (i < len) { c = array[i++]; switch (c >> 4) {case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: out += String.fromCharCode(c); break; case 12: case 13: char2 = array[i++]; out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F)); break; case 14: char2 = array[i++]; char3 = array[i++]; out += String.fromCharCode(((c & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0)); break;}} return out;}
function isBase64(str) { try { return btoa(atob(str)) === str;} catch (err) { return false;}}

//This is the function for when the retrieval was successful of the XML file
function successHandler(data) {
    var myID = Math.floor((Math.random() * 100000000) + 1);
    var newHTML = "";
    jQuery(data).children().find("*").each(function () {
        //This will loop through every element in the XML file and this next line checks to see if it's a giant string of text and is encoded using Base64 (used for files)
        if (jQuery(this).text().length > 5000 && isBase64(jQuery(this).text())) {
            //Decodes the Base64 string so we can extract certain pieces. The resulting array is a 24-byte header + Filename bytes + File content bytes
            var bytes = base64ToArray(jQuery(this).text());
            var arrFileNameBytes = [];
            //The 20th byte in these InfoPath attachments header contains just how long the filename is supposed to be; multiply by 2 for Unicode encoding (weird chars in between each letter)
            var fileNameByteLen = bytes[20] * 2;
            var i = 0;
            //Grab every other character of the filename array (should just be the actual usable characters) except the last one as it is always a \0
            while (i < fileNameByteLen - 2) { arrFileNameBytes.push(bytes[24 + i]); i += 2;}
            //Now that we have an array of bytes for the filename, convert from 2-digit codes to actual letters/numbers/symbols
            var fileName = Utf8ArrayToStr(arrFileNameBytes);
            //In case there's weird spaces at the beginning or end of filename, remove them
            fileName = fileName.trim();
   fileName = this.nodeName.substr(3) + "-" + fileName;
            //I've found in our implementation that some PDFs don't appear to have their file extension listed like others in filename, so, if there is no extension in the filename, add .PDF
            if (fileName.slice(-(5)).indexOf(".") === -1) { fileName += ".pdf";}
            var fileContentByteLenStart = 24 + fileNameByteLen;
            //Create another array with JUST the content of the file by starting 24 chars (header) + filename length in bytes into the byte array
            var arrFileContentBytes = bytes.slice(fileContentByteLenStart);
            //Convert byte array into a Blob (Binary Large OBject) - basically it reconstructs the file
            var myBlob = new Blob([arrFileContentBytes]);
            var myBlobUrl = URL.createObjectURL(myBlob);
            //This next section just creates a link on the webpage to download the file
            var myLink = "<a href='" + myBlobUrl + "' download='" + fileName + "'>Click here to download " + fileName + " | </a>";
            newHTML += myLink;
        }
    });
    //If there were any attachments then append download links to the html page where you placed this script
    if (newHTML.length > 1) {
        newHTML = "<div id='" + myID + "' style='display:block;clear:both;border: 1px solid black'>" + newHTML + "</div>";
        jQuery("#s4-mainarea").append(newHTML);
    }
}

function errorHandler(data) {
    console.log("Request failed: " + data);
}
function myMain() {
    "use strict";
    //These next variables ask for the URL to the library and a string of filenames to process (so you can limit how many you do at once) and splits that list for processing
    var libraryUrl = prompt("What is the url to the library in question (DO NOT INCLUDE TRAILING SLASH)?");
    var fileNames = prompt("Please give a list of filenames to process (including extension like .xml for infopath) separated only by commas");
    var arrFileNames = fileNames.split(",");
    var i = arrFileNames.length;
    //Now begins the main looping through each file, attempting to retrieve the infopath XML file, and scanning it if it successfully retrieved it
    while (i--) {
        jQuery.ajax({
            url: libraryUrl + "/" + arrFileNames[i] + "?Noredirect=true",
            type: "GET"
        }).done(Function.createDelegate(this, successHandler)).error(Function.createDelegate(this, errorHandler));
    }
}
//Loads jQuery if needed and then runs myMain function
"use strict";
if (typeof jQuery === "undefined") { (function (e, s) { e.src = s; e.onload = function () {jQuery.noConflict(); myMain();}; document.head.appendChild(e);})(document.createElement("script"), "//code.jquery.com/jquery-latest.min.js");} else { myMain();}

Further Explanation: So here's how it pulls together in either script:

  1. InfoPath XML attachment files are first a long string that has been "encoded" into something called Base64. You don't have to know what that means, you just have to know they were encoded - which means to use them we need to decode them. That's the first step once we have the InfoPath XML file.
  2. The attachment files (once decoded) are composed of a header portion and the data portion. The header consists of 2 parts: a fixed header set of information and the filename of the attachment.
  3. Byte 20 of the header tells you just how long the filename is as a number. This is a byte but needs to be doubled for how the information is encoded (unicode or UTF8).
  4. Once we know how long the filename is, we look *just* past the end of the regular, fixed header (byte 24) and copy every other byte till the end of the filename section - except for the very last one as it has a byte to tell the system that this is the end of the file name. The reason we do every other is that the ones in between seem to not be letters or characters...probably some hidden characters of some sort. One weird thing - PDFs in my environment didn't have a file extension at the end of their name...so I created a small workaround in the javascript.
  5. Now that we have the filename, we need to separate out the contents from the header, so we create a new array that just starts after the end of the filename.
  6. This new array needs to be somehow converted into a file...in Powershell, we use the WriteAllBytes function to make it a file and we just tell it where the path is that it should write and give it this blob of goodness; in Javascript, this new array is converted into a blob (binary large object)
  7.  Getting access to the file itself is easy for Powershell - when the script is done, go look at your folder; for Javascript, it creates a block on the bottom of your SharePoint page for each file that had attachments and sticks them all in there along with the name of the node from which the attachment came.