Where is a checked out file saved in SharePoint before it is checked back in?
Consider the following:
- ‘A-Document.doc’ already exists in SharePoint.
- Sally checks out ‘A-Document.doc’, makes changes, saves the file, and does not check the file back in.
- Scott opens the file and sees it as it was prior to Sally’s edits.
So, I know what you are thinking… “SharePoint is working as designed.”
Great, but do you know the answer to the question? Where are Sally’s edits saved?
Perhaps you said, “She’s probably using local drafts and her edits are likely saved in ‘My Documents’ in a folder called ‘SharePoint Drafts’”.
Nope. Wrong answer. (although a copy of the document before the edits was likely placed in that location when she checked the file out)
You see, Sally can log on to Scott’s computer (with her user id), open the file (remember, it’s still checked out to her) and see her edits. Clearly, the updated file is stored in SharePoint and visible only to her.
Site collection administrators and even farm administrators cannot access Sally’s edits (still checked out to her).
I got it! If we want to access Sally’s edits (and she has not checked in the file) we can go to "Manage checked out files" in the library settings. Nice try… That only works on a file that has never been checked in (More info here: http://bit.ly/kKxU8l). Sweet.
So, knowing a little something about SharePoint, I’d be willing to bet this file is in the content database.
Diving into the details:
Disclaimer: Never work directly with the content database (as I do in this post). Even though I access it read only, this can still cause unexpected locking. Furthermore, I did this testing in a lab environment with no active users and a content database that will never need to “be supported” by Microsoft.
Ok, looking at a content database, I see 3 tables of interest:
AllDocs, AllDocVersions & AllDocStreams
In my test environment, I created a new content database and then created 1 site collection in that content database (using blank template). Next, I created 1 test library (added no content to the TEST library yet)
I started running queries against the content database.
“SELECT * FROM dbo.AllDocs” returns 108 records.
Why so many records with one blank site in this database? Well, the AllItems.aspx in the library I just created counts as one record. There are many artifacts needed to support a blank site.
“SELECT * FROM dbo.AllDocVersions” returns 0 records.
“SELECT * FROM dbo.AllDocStreams” returns 0 records.
OK, on with my testing…
I log on as User A and create a document called ‘A-Document.doc’ in my test library. I save the file and close it (the file is not checked out)
“SELECT * FROM dbo.AllDocs” now returns 109 records.
Think we have found the document…? Think again. We have actually found information about the document, but not the document.
Let’s look at the other tables before we jump to conclusions.
“SELECT * FROM dbo.AllDocVersions” still returns 0 records.
“SELECT * FROM dbo.AllDocStreams” returns 1 record. (Nothing as ‘obvious’ as above… a few id related columns with what appears to be guid’s, etc…)
I examined the column types in the AllDocStreams table, noticed that there is a column named ‘Content’ and the data type is ‘image’. (in my query, this looked like a string of random characters)
Well, this is exactly the type of column where BLOB data (documents) would get stored. Armed with additional information, I did some more targeted googling, I mean binging. I was able to confirm that this table stores the actual file contents.
The field highlighted above contains the document.
We will disregard the AllDocVersions table throughout the rest of this test as it is not relevant in this scenario.
I’ll share the results of one more query to “tie this together” before we proceed. We’ll call this “Big Ugly Query”, I’ll use it more later.
SELECT dbo.AllDocs.LeafName, dbo.AllDocStreams.Content FROM dbo.AllDocs INNER JOIN dbo.AllDocStreams ON dbo.AllDocs.ID = dbo.AllDocStreams.ID AND dbo.AllDocs.Level = dbo.AllDocStreams.Level where LeafName=’A-Document.doc’
The above query returns one row.
So, to be clear, in the above results ’LeafName’ and ‘Content’ are coming from different tables. I am just presenting the results like this for clarity. Also, it’s worth noting that the AllDocs table contains information about the document while the AllDocStreams table contains the actual document.
Now, let’s test the scenario described at the beginning of this post.
I logon to another machine as User B. As that user, I checkout the document, make some edits, save the document and do not check it back in. As we would expect, if User A opens the document they do not see User B’s edits. They see just the original document contents.
We now have two rows in the AllDocs table for this document. The 2nd row is the one associated with User B’s changes (but is not the actual document, remember?)
Ok, remember “Big Ugly Query”? Now, it returns two rows. I have highlighted the document containing User B’s edits.
Ok, I answered the question. It’s probably safe to say that we know exactly where that copy of the document is stored (before it is checked in).
I, however, require even more validation. So, I decided to learn how to use c# to read blob data from SQL and verify that I can gain access to User B’s changes (even though the document is checked out and I am not User B).
Proof of concept
How to get User B’s edits without logging in as User B.
Disclaimer: Never blindly use code you “found on the Internet”. This is especially true if the code accesses SharePoint databases. This would qualify as a CLM (career limiting maneuver). Even though I understand this code, I would not use it in my live SharePoint environments (remember, I performed this work in a lab setup that users do not access and that does not need to be supported by Microsoft)
Also, this code would barely qualify as alpha code. This could be made much more dynamic and robust. I just wanted to quickly validate the findings above.
Essentially, the code below performs a SQL query. For each row returned, it reads the ‘Content’ field (the file) into a byte array and dumps that out to a file on the local file system. Simple? Yes. Supported? No.
This compiles fine in VS2008.
using System; using System.IO; using System.Data; using System.Data.SqlClient; namespace tmpReadBlob { class Program { static void Main(string[] args) { string fileName = "A-Document.doc"; String myQuery = "SELECT dbo.AllDocStreams.Content FROM dbo.AllDocs INNER JOIN dbo.AllDocStreams ON dbo.AllDocs.ID = dbo.AllDocStreams.ID AND dbo.AllDocs.Level = dbo.AllDocStreams.Level where LeafName='" + fileName + "'"; SqlConnection con = new SqlConnection("Server=SQLSERVER;Integrated security=true;database=CONTENTDB"); SqlDataAdapter da = new SqlDataAdapter(myQuery, con); SqlCommandBuilder MyCB = new SqlCommandBuilder(da); DataSet ds = new DataSet("AllDocStreams"); byte[] MyData = new byte[0]; da.Fill(ds, "AllDocStreams"); int i = 1; FileStream fs; int ArraySize = new int(); foreach (DataRow myRow2 in ds.Tables["AllDocStreams"].Rows) { MyData = (byte[])myRow2["Content"]; ArraySize = MyData.GetUpperBound(0); fs = new FileStream(@"c:\test" + i + ".doc", FileMode.OpenOrCreate, FileAccess.Write); fs.Write(MyData, 0, ArraySize); fs.Close(); i++; } Console.WriteLine("done."); Console.ReadKey(); } } }
The results are shown below:
Note: I opened both files from the file system. ‘test1.doc’ contained the original file and ‘test2.doc’ contained User B’s edits.
Q: Is there a supported way to get to get User B’s edits (without logging in as User B)?
A: I *think* this would be supported. Get a full backup of the content database and attach to a test server. Use a process similar to what I have described above to extract the file(s) from that test server. Browse to your SharePoint site using Internet Explorer and upload the files you extracted. (discard the copy you made of the content database on the test server). This seems like a supported approach to me, as the only time you interact with production is when you upload files from your local file system using the web browser (supported operation).
Final thoughts:
Really? Even as a SharePoint 2007 farm administrator I can’t access data stored in SharePoint because it is “checked out”? Also, "Manage checked out files" is really this weak?
I believe that "Manage checked out files" should let owners and/or administrators access User B’s saved content that is not checked in. (Take ownership?)
Now, to be fair, I did this testing on SharePoint 2007. I decided to explore whether "Manage checked out files" was improved in SharePoint 2010.
Nope… It’s still a weak tool. It did, however, get renamed. It is now called “Manage files which have no checked in version” 😉 Wow, seems like allowing someone to access or take ownership of saved content that is not checked in would have required minimal effort.
Anyway, this is not really a huge issue and I can imagine one’s of people out there really desiring this functionality. It’s just odd when, as a systems administrator, you do not have the access or visibility into a system to clearly articulate where data is stored (in all situations) and how to access it if/when needed.
If you’ve made it through this post, you’re probably a lot like me. You like to clearly understand details of how the system works. Also, I added two new categories to my blog as of this this post: pathfinding & minutiae 😉
Thanks!
Jeremy
@jjclore