Thursday, November 30, 2017

Extracting/Downloading Attachments Inside InfoPath Forms or Download All Files in a SharePoint Library

SITUATION: It's been a while but I'm back to working more with SharePoint - YAY - and was hit with my first unique request: we want to find a way to extract files that are inside of InfoPath attachment controls. **EDIT 2/1/2018** Along the way, also had to figure out how to download all files in a library.

RESOLUTION: I apologize now: THERE IS NO OOTB WAY TO JUST GRAB THESE SUCKERS...gotta use SOME form of code or script :(. So, I figured out two ways to do this: a powershell way that admins can use to download the files locally and then extract the attachments (does use more disk space) and a javascript way that works for anyone who has access to the form libraries but requires that you download the attachments separately (not as much fun).

The ways presented are definitely a beta and should be applied only to a test instance of SharePoint so you don't kill your server or computer. I'm definitely not responsible for you using this script and it messing stuff up - review it, test it, follow your company's standards for verification and testing, etc.
I welcome feedback if you discover a bug in it but feel free to use how you see fit, tweak it, whatever. It'd be nice to keep some credit if you end up doing amazing things with it - just let me know :)!
This was tested using Chrome against a SharePoint Server 2010 instance. This may or may not work in other versions, I'm sure you could test it and let me know.
******END CAVEATS******

Now, it took me a while to get all the pieces together, so here are the links to the information needed to pull this off <begin credits>:
  1. MS explains how to do this in visual studio code here
  2. Chris White expounds on the same thing in some different detail here
  3. Different stack exchange questions related to pushing and pulling from XML in a form library as well as handling the attachment raw data and rebuilding it here, here, and here
  4. Masanao Izumo has a cross-browser implementation for converting an unsigned 8 byte integer array back to a string (text) here <end credits>
Powershell Script (skip to the bottom to see explanation of how this all came together). Has 2 requirements and 2 optional pieces for which it prompts: 1) URL of the sharepoint web (aka the regular site, not site collection), 2) The name of the form library, and 3) if you are downloading InfoPath forms, what data source in the xml file do you want to use to create a folder on your computer in which to place all of the attachments, and 4) a cutoff date in case you want to only download items before a certain date. The last part that you *might* edit is the $filepath since that is the main path on your computer for where you want to stick all these attachment files. Here's the script:

***EDIT 2/1/2018*** Updated PowerShell script to download the files THEN extract their contents as well as provide a progress bar and added the cutoff date. This can be used to download all files in a library as well and it'll ignore any non-infopath forms when it tries to extract attachments
if ((Get-PSSnapin "Microsoft.SharePoint.PowerShell" -ErrorAction SilentlyContinue) -eq $null) 
    Add-PSSnapin "Microsoft.SharePoint.PowerShell"

function GetFilesAndAttachmentsFromLibrary{
 <#Gather and set parameters#>
 $siteurl = read-host "What is the url to the site in question?"
 $siteurl = $siteurl.trim().trimend("/\")
 $libraryname = read-host "What is the library name?"
 $libraryname = $libraryname.trim().trimend("/\")
 $folderStructureNode = read-host "If InfoPath forms found, what data source should be used to create folders?"
 $strCutOffDate = read-host "Please enter the last date you wish to archive, leave empty for all dates - we will process till Modified date/time is 11:59:59 of cutoff day"
 $cred = get-credential
 <#Create an internet browser object for downloading and set the authentication information for it#>
 $webclient = New-Object System.Net.WebClient
 $webclient.Credentials = $cred
 <#Check if the cutoff date is specified or a folder structure node to create folders for embedded attachments within XML files#>
 if($strCutOffDate -ne ""){ $dtCutOffDate = (get-date -date "$strCutOffDate 11:59:59 PM") } else { $dtCutOffDate = "" }
 if($folderStructureNode -ne ""){ $folderStructureNode = $folderStructureNode.ToLower() }
 <#Create local paths for storing the downloaded files and their attachments, change filepath1 to your liking but leave the $libraryname as an easy folder name#>
 $filepath1 = "C:\SPScripts\$libraryname\".replace("/","")
 $filepath2 = $filepath1+"_DownloadedRawFiles\"
 <#Get SharePoint web aka website, then get the library in question#>
  $web = Get-SPWeb -Identity $siteurl -ErrorAction SilentlyContinue
  if($web -eq $null -or $web -eq ""){ throw }
 } catch { write-error "Failed to obtain the website, possibly bad url or bad credentials" -EA Stop }
  $list = $web.lists[$libraryname]
  if($list -eq $null){ throw }
 } catch { write-error "Failed to obtain list, possibly bad Name or URL or list is not actually in that site" -EA Stop }
 <#Assuming everything went well, try to create a folder locally for the file downloading#>
 if(!(test-path $filepath2 -PathType Container)){ New-Item -ItemType Directory -Force -Path $filepath2 | out-null }
 <#Create the query for only 1000 records from the list with only 3 fields of data to keep the query small#>
 $query = New-Object Microsoft.SharePoint.SPQuery
 $query.ViewAttributes = "Scope='Recursive'"
 $query.RowLimit = 1000
 $query.ViewFields = "<FieldRef Name='ID'/><FieldRef Name='LinkFilenameNoMenu'/><FieldRef Name='Last_x0020_Modified'/>"
 $query.ViewFieldsOnly = $true
 <#Looping logic - approximating the size of the library because a giant library would use all of your RAM before you could process it#>
 $loopCounter = 0
 $loopTotal = $list.itemcount
 $interval = [math]::Round($loopTotal/20)
 if($interval -lt 0){ $interval = 1 }
 <#Execute the query to get the list items, get the position in case there are more than 1000 items, loop through the files, show our progress, download file#>
  $myFiles = $list.GetItems($query)
  $query.ListItemCollectionPosition = $myFiles.ListItemCollectionPosition
  foreach($file in $myFiles){
   if(($loopCounter % $interval) -eq 0){ write-progress -id 1 -activity "Step 1 of 2: Downloading Files" -status "Working on $loopCounter of appx $loopTotal" -percentComplete ($loopCounter/$loopTotal*100) }
   if($dtCutOffDate -ne "" -and (Get-Date -date $file["Last_x0020_Modified"]) -gt $dtCutOffDate){ continue }
   $webclient.DownloadFile($siteurl + "/" + $file.Url + "?NoRedirect=true",$filepath2+$file.Name)
  write-progress -id 1 -activity "Step 1 of 2: Downloading Files" -status "Completed" -Completed
 } while($query.ListItemCollectionPosition -ne $null);
 <#Clean up the web object to prevent memory leak#>
 <#Grab all xml (InfoPath) files in the download to process for embedded attachments; if there aren't any, we are done; if there are, find out how many and set loop info#>
 $myFiles = Get-ChildItem -Path "$filepath2\*" -Include "*.xml" -Recurse
 if ($myFiles.Count -eq "" -or $myFiles -eq $null){ return }
 $loopCounter = 0
 $loopTotal = $myFiles.count
 $interval = [math]::Round($loopTotal/20)
 if($interval -lt 0){ $interval = 1 }
 <#Begin processing files#>
 foreach($file in $myFiles){
  if(($loopCounter % $interval) -eq 0){ write-progress -id 1 -activity "Step 2 of 2: Extracting Attachments" -status "Working on $loopCounter of $loopTotal" -percentComplete ($loopCounter/$loopTotal*100) }
  [xml]$xml = Get-Content $file
  $myNodes = $xml.SelectNodes("//*")
  if($folderStructureNode -ne ""){ $folderName = $xml.SelectSingleNode("//*[translate(local-name(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='$folderStructureNode']").innertext }
  if(($folderName -eq "" -or $folderName -eq $null) -or ($folderStructureNode -eq "" -or $folderStructureNode -eq $null)){ $folderName = $file.BaseName }
  $createFolder = 0
  if(!(test-path $filepath1$folderName -PathType Container)){
   $createFolder = 1
  for($j=0;$j -lt $myNodes.Count;$j++){
   $b64 = $myNodes.Item($j) | select-object -ExpandProperty "#text" -ErrorAction SilentlyContinue
   if($b64.length -gt 2000 -and $b64.indexOf(" ") -eq -1){
    $b64name = $myNodes.Item($j) | select-object -ExpandProperty "name"
    $b64name = $b64name.Substring(3)
    $bytes = [Convert]::FromBase64String($b64)
    if($bytes.length -gt 0){
     $arrFileNameBytes = @()
     <#When the attachment is broken into byte strings, the 20th byte tells you how many bytes are used for the filename. Multiply by 2 for ASCII encoding#>
     $fileNameByteLen = $bytes[20]*2
     <#Recreate the filename using every other char from filename bytes#>
     for($i=0;$i -lt $fileNameByteLen-2;$i+=2){
     $arrFileContentBytes = @()
     <#Determine content length by Total - Header - Filename#>
     $fileContentByteLen = $bytes.length-(24+$fileNameByteLen)
     <#Create new array by cloning the content bytes into new array#>
     $arrFileContentBytes = $bytes[(24+$fileNameByteLen)..($fileContentByteLen+24+$fileNameByteLen)]
     $fileName = [System.Text.Encoding]::ASCII.GetString($arrFileNameBytes)
     $fileName = $fileName.trim()
     $fileName = $fileName -replace '[^\p{L}\p{Nd}/(/_/-/)/.]',''
     if(($fileName.indexOf(".",$fileName.length - 5)) -eq -1 -or (($fileName.indexOf(".") -eq -1) -and $fileName.length -lt 5)){ $fileName = "$fileName.pdf" }
     $fileName = $b64name+" - "+$fileName
     if($createFolder -eq 1) {
      New-Item -ItemType Directory -Force -Path $filepath1$folderName | out-null 
      $createFolder = 0
      $folderName += "\"
     if(test-path $filepath1$folderName\$fileName){
      $myLoop = 1
      $fileNamePre = $fileName.substring(0,$fileName.length-5+($fileName.substring($fileName.length-5).indexOf(".")))
      $fileNamePost = $fileName.trimStart($fileNamePre)
      while(test-path $filepath1$folderName$fileName){
       $fileName = $fileNamePre+"("+$myLoop+")"+$fileNamePost
 write-progress -id 1 -activity "Step 2 of 2: Extracting Attachments" -status "Completed" -Completed

Javascript version (for all you regular people out there ^_^, skip to the bottom to see explanation of how this all came together). This one only needs two things: the URL to the form library and a comma-separated string of text for each of the files you want to access (e.g. filename1.xml,filename7.xml,myForm2017-11-30T11_01_01.xml). You can get that list of names by creating a view in for the library that shows just the "Name(for use in forms)" column and exporting that to Excel. I usually just make sure all of them end up in cells next to each other and save as a CSV. Here's the script:

** Credit: Base64 to array from
** Credit: UTF8 Array to Str from (Masanao Izumo is the author)
** Created by: Brendan Horner 11/29/2017
** You can use this in the console if you load up any SharePoint page - possibly even throw this on an empty page in SharePoint.
//These initial functions load a few utility functions for converting a base64 string of numbers into a file
function base64ToArray(base64) { var binary_string = window.atob(base64); var len = binary_string.length; var bytes = new Uint8Array( len ); for (var i = 0; i < len; i++){ bytes[i] = binary_string.charCodeAt(i); } return bytes; }
function Utf8ArrayToStr(array) { var out; var i; var len; var c; var char2; var char3; out = ""; len = array.length; i = 0; while (i < len) { c = array[i++]; switch (c >> 4) {case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: out += String.fromCharCode(c); break; case 12: case 13: char2 = array[i++]; out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F)); break; case 14: char2 = array[i++]; char3 = array[i++]; out += String.fromCharCode(((c & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0)); break;}} return out;}
function isBase64(str) { try { return btoa(atob(str)) === str;} catch (err) { return false;}}

//This is the function for when the retrieval was successful of the XML file
function successHandler(data) {
    var myID = Math.floor((Math.random() * 100000000) + 1);
    var newHTML = "";
    jQuery(data).children().find("*").each(function () {
        //This will loop through every element in the XML file and this next line checks to see if it's a giant string of text and is encoded using Base64 (used for files)
        if (jQuery(this).text().length > 5000 && isBase64(jQuery(this).text())) {
            //Decodes the Base64 string so we can extract certain pieces. The resulting array is a 24-byte header + Filename bytes + File content bytes
            var bytes = base64ToArray(jQuery(this).text());
            var arrFileNameBytes = [];
            //The 20th byte in these InfoPath attachments header contains just how long the filename is supposed to be; multiply by 2 for Unicode encoding (weird chars in between each letter)
            var fileNameByteLen = bytes[20] * 2;
            var i = 0;
            //Grab every other character of the filename array (should just be the actual usable characters) except the last one as it is always a \0
            while (i < fileNameByteLen - 2) { arrFileNameBytes.push(bytes[24 + i]); i += 2;}
            //Now that we have an array of bytes for the filename, convert from 2-digit codes to actual letters/numbers/symbols
            var fileName = Utf8ArrayToStr(arrFileNameBytes);
            //In case there's weird spaces at the beginning or end of filename, remove them
            fileName = fileName.trim();
   fileName = this.nodeName.substr(3) + "-" + fileName;
            //I've found in our implementation that some PDFs don't appear to have their file extension listed like others in filename, so, if there is no extension in the filename, add .PDF
            if (fileName.slice(-(5)).indexOf(".") === -1) { fileName += ".pdf";}
            var fileContentByteLenStart = 24 + fileNameByteLen;
            //Create another array with JUST the content of the file by starting 24 chars (header) + filename length in bytes into the byte array
            var arrFileContentBytes = bytes.slice(fileContentByteLenStart);
            //Convert byte array into a Blob (Binary Large OBject) - basically it reconstructs the file
            var myBlob = new Blob([arrFileContentBytes]);
            var myBlobUrl = URL.createObjectURL(myBlob);
            //This next section just creates a link on the webpage to download the file
            var myLink = "<a href='" + myBlobUrl + "' download='" + fileName + "'>Click here to download " + fileName + " | </a>";
            newHTML += myLink;
    //If there were any attachments then append download links to the html page where you placed this script
    if (newHTML.length > 1) {
        newHTML = "<div id='" + myID + "' style='display:block;clear:both;border: 1px solid black'>" + newHTML + "</div>";

function errorHandler(data) {
    console.log("Request failed: " + data);
function myMain() {
    "use strict";
    //These next variables ask for the URL to the library and a string of filenames to process (so you can limit how many you do at once) and splits that list for processing
    var libraryUrl = prompt("What is the url to the library in question (DO NOT INCLUDE TRAILING SLASH)?");
    var fileNames = prompt("Please give a list of filenames to process (including extension like .xml for infopath) separated only by commas");
    var arrFileNames = fileNames.split(",");
    var i = arrFileNames.length;
    //Now begins the main looping through each file, attempting to retrieve the infopath XML file, and scanning it if it successfully retrieved it
    while (i--) {
            url: libraryUrl + "/" + arrFileNames[i] + "?Noredirect=true",
            type: "GET"
        }).done(Function.createDelegate(this, successHandler)).error(Function.createDelegate(this, errorHandler));
//Loads jQuery if needed and then runs myMain function
"use strict";
if (typeof jQuery === "undefined") { (function (e, s) { e.src = s; e.onload = function () {jQuery.noConflict(); myMain();}; document.head.appendChild(e);})(document.createElement("script"), "//");} else { myMain();}

Further Explanation: So here's how it pulls together in either script:

  1. InfoPath XML attachment files are first a long string that has been "encoded" into something called Base64. You don't have to know what that means, you just have to know they were encoded - which means to use them we need to decode them. That's the first step once we have the InfoPath XML file.
  2. The attachment files (once decoded) are composed of a header portion and the data portion. The header consists of 2 parts: a fixed header set of information and the filename of the attachment.
  3. Byte 20 of the header tells you just how long the filename is as a number. This is a byte but needs to be doubled for how the information is encoded (unicode or UTF8).
  4. Once we know how long the filename is, we look *just* past the end of the regular, fixed header (byte 24) and copy every other byte till the end of the filename section - except for the very last one as it has a byte to tell the system that this is the end of the file name. The reason we do every other is that the ones in between seem to not be letters or characters...probably some hidden characters of some sort. One weird thing - PDFs in my environment didn't have a file extension at the end of their I created a small workaround to just append .pdf if there was not any extension mentioned.
  5. Now that we have the filename, we need to separate out the contents from the header, so we create a new array that just starts after the end of the filename.
  6. This new array needs to be somehow converted into a Powershell, we use the WriteAllBytes function to make it a file and we just tell it where the path is that it should write and give it this blob of goodness; in Javascript, this new array is converted into a blob (binary large object)
  7.  Getting access to the file itself is easy for Powershell - when the script is done, go look at your folder; for Javascript, it creates a block on the bottom of your SharePoint page for each file that had attachments and sticks them all in there along with the name of the node from which the attachment came.