Вы находитесь на странице: 1из 28

PDF AssociationTechnical Conference June 18-19 2013

PDF and Microsoft Sharepoint


Hurdles to Overcome

Neil Pitman Aquaforest Limited

Version 1.120613

Objective

PDF as a Sharepoint First Class Citizen

Objectives Sharepoint Overview PDF Capture PDF Search

Agenda

iFilters Handling Image and Mixed Mode PDFs

PDF Metadata
Dictionary, XMP and Entity Extraction

Configuration
Sharepoint 2010 , 2013

Summary

Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage

What is Sharepoint?
On-Premise and Cloud-based Collaboration & Document Management Platform

Sharepoint Overview

Origin - 2001 Usage


Focus on MS Office Documents Typically distributed capture

Sharepoint Editions (2010, 2013)

Sharepoint Overview

Foundation Standard Enterprise

Office 365 / Sharepoint Online Ecosystem


Partner Products Office / Sharepoint Marketplace

Sharepoint Architecture Overview

MS Web-based (IIS) MS Office Integration SQL Server Storage

List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.

Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.

Thresholds and limits help throttle operations and balance resources for many simultaneous users.

Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.

Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.

Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.

Microsoft Technology Stack

Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office

Options
Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers

PDF Capture for Sharepoint

WebRequest request = WebRequest.Create(destUrl); request.Credentials = CredentialCache.DefaultCredentials; request.Method = "PUT"; byte[] buffer = new byte[1024]; using (Stream stream = request.GetRequestStream()) using (MemoryStream ms = new MemoryStream(fileBytes)) { for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length)) { stream.Write(buffer, 0, i); } } WebResponse response = request.GetResponse(); response.Close(); Logging.Log("Upload successful");

Acrobat XI Sharepoint Integration

http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html

PDF Search in Sharepoint Overview

Item 1 Item 2

iFilters scan documents for text and attributes primarily in support of Microsoft Search technologies.

iFilter Architecture

iFilter Configuration

Architecture Code Sample Suppliers Issues

iFilter Explorer

PDF Search in Sharepoint : iFilters

iFilter Explorer

https://gist.github.com/jimschubert/1473904
StringBuilder Buffer=new StringBuilder(); string PDFFile = @"C:\dev\PDF Conference\s.pdf"; FilterCode f=new FilterCode(); f.GetTextFromDocument(PDFFile, ref Buffer); Console.WriteLine(Buffer);
public void GetTextFromDocument(string Path, ref StringBuilder Buffer) { IFilter filter = null; int hresult; IFilterReturnCodes rtn; // Initialize the return buffer to 64K. Buffer = new StringBuilder(64 * 1024); // Try to load the filter for the path given. hresult = LoadIFilter(Path, new IntPtr(0), ref filter); if (hresult == 0) { IFILTER_FLAGS uflags; // Init the filter provider. rtn = filter.Init( IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS | IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT.IFILTER_INIT_CANON_SPACES | IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES | IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY, 0, new IntPtr(0), out uflags); if (rtn == IFilterReturnCodes.S_OK) { STAT_CHUNK statChunk;

Using iFilters directly in Code

[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)] static extern int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk);

iFilter Test
Bookmark Text

XMP Metadata

PDF Attachment

Image/OCR Text Dictionary Metadata

Annotation

Adobe iFilter Body Text

PDFLib iFilter

FoxIt iFilter

Microsoft Format Handler

iFilter Test Results

Annotations Bookmarks Dictionary Metadata XMP Metadata PDF Attachment

Classify :
Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed

Dealing with Image and Mixed-Mode PDFs

Objectives:
Ensure Full Searchability Avoid Text to Image Processing

Process :

Dealing with Image and Mixed-Mode PDFs

Capture Time? Scheduled In-Place?

Text Search vs Metadata Search Crawled vs Managed Properies Review Requirements


Dictionary Metadata XMP Metadata Entity Extraction

PDF Metadata In Sharepoint

Consider Automation

Crawled vs Managed Properies

PDF Metadata In Sharepoint

PDF Metadata In Sharepoint : Using Event Receivers

Event Receivers can enable Metadata assignment

Entity Extraction

PDF Metadata In Sharepoint

Configuration

Sharepoint 2010 Sharepoint 2013

Missing icon and iFilter

Sharepoint 2010 PDF Configuration

http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf

Sharepoint 2010 PDF Configuration

Default for PDF : X-Download-Options: noopen' added to HTTP Response Header

Sharepoint PDF Configuration

PDF Format Handler Support Currently no iFilter Support for PDF !?!?!!

Sharepoint 2013 and PDF Configuration

Inline Viewing PDF in Sharepoint 2013

Sharepoint 2013 and PDF Configuration

http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html

Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage PDF as a Sharepoint First Class Citizen

Summary

Contact : neil.pitman@aquaforest.com

Вам также может понравиться