PDF Remediation – Where do I start?

Tips and best practices for remediating a PDF document.

Video transcript

[Dan Tuleta] All right, well welcome everyone to another edition of Equidox Webinar Wednesdays. It's just about two o'clock, so I think we should get started. Now for anyone that has joined us before, we really appreciate you continuing to attend these webinars to learn a bit more about document remediation and Equidox. So today's presentation is going to be talking about refining and designing a workflow for attacking these documents from a remediation standpoint. So as always please feel free to reach out to us at any given time through Our website is www.Equidox.co. So we do encourage follow-up questions or feedback for these webinars and if you'd like to see any personalized demonstrations of Equidox, maybe more tailored to your specific workflow, please feel free to reach out to us. We would love to start a conversation with you. Now in terms of document remediation, if you are assigned with remediating documents… if this is your everyday job… or if this is just a one-off project, maybe that that you've been asked to complete as a content creator, there's a few things to consider in the pre-flight process. So before you get started with remediating a document, one of the first things that I like to look at is how many pages are there, If there are just one or a couple of pages in this document, you know, then then you can kind of just power through it and just kind of approach it by just opening the document up and taking a look at what you have and start working on it. Now if there are a lot of pages in this document, typically I like to take a few minutes to just investigate what the document is compiled of. Because you might choose to be a little bit more strategic and tactical in how you approach that document. Because there could be a number of things that you could consolidate… steps along the way that will eliminate redundancies and, you know, forcing you to repeat steps page after page. So we'll take a look at a few of those things later on during the demonstration. Now another thing to consider is how complex is the content? Now a document might have a lot of pages, but if you consider a document maybe that's just all text, for example, where you just have a series of paragraphs page after page after page… The great thing about Equidox is its ability to auto-detect that type of content. And it's going to do a lot of the heavy lifting for you. So the complexity of the content does have a lot to do with your workflow as you remediate. Now a 100-page document made up entirely of text elements could be easier than a 10-page document although it's much longer. A 10-page document might have very technical engineering drawings, or very complex elements that take more actual work to tag properly. So something to look at is the complexity of the content. To kind of build out the workflow in your head before you get started. Another thing that I look at is, is the design and the formatting of the document consistent throughout? So is this document coming from a template? Is there a consistent use of heading structure throughout? Are the lists properly formatted throughout the document? How consistent is this page after page? Sometimes I see documents which are almost like a Frankenstein document. Where it's a series of pages that have come together from various locations. There's nothing consistent about them. Every page is unique, and every page has to be treated like as its own individual puzzle. So it's just important to note if there's consistent formatting throughout the document. There are ways within Equidox to consolidate steps and to eliminate some redundancies so that you don't have to repeat the same exact steps page after page. Now another thing that I look at always, is if there is existing tag structure associated with the document. And existing tag structure could be good or bad. Now many documents, depending on where they come from… for example, Microsoft Word being a very common one, when you hit “Save as PDF” coming out of Microsoft Word oftentimes there is an automated tagging process where Microsoft Word will attempt to convert that word file into a PDF and generate some automated tags throughout that process. Now that can be good or bad, depending on the way that the document was put together. But the point is that you don't really know what you're going to get during that conversion process from Microsoft Word into PDF. Because how those how those tags are generated is basically a mystery to the user. You just have to kind of wait and see what you get. There are of course many things that you can take into consideration when you're designing your documents to improve the accessibility of them when you do convert them into a PDF. But they still need to be looked at because PDF and Word are different formats and they just have different ways of marking up documents from an accessibility standpoint. So just because your Microsoft Word document is technically accessible, it does not mean that the PDF that that generates is going to be automatically accessible. So you still need to take into consideration that the existing tags that were generated automatically through Microsoft Word, they might still need to be touched up or adjusted. Now there are other documents that a that may have been remediated at some point in their life cycle where someone might have opened the document in Adobe Acrobat and pressed auto tag and just kind of crossed their fingers and hoped that they got some sort of usable tag structure. This can be something that is going to cause you more problems than be of service. So these are just things to look at. If you have existing tag structure. What is the quality of those existing tags, and are they worth maintaining, or is it better to just use Equidox and start from scratch? Because of the automation and all of the tools that Equidox gives you to tag documents, there are many instances where it can be easier to just start over rather than trying to rescue poor existing tags. Another feature that or another aspect that I like to look at is does the document require OCR? And OCR stands for Optical Character Recognition. This can be relevant for documents that are scanned, for example, or documents that might contain infographics that have text that is actually locked inside of images. Now OCR certainly adds another layer of complexity to remediation because in order to make these documents compatible with screen readers and machine-readable, an image is simply an image. We need to extract text from it if there is in fact text inside of it. So if you have a document where someone took an old newspaper and ran it through a scanner, technically that document can come out of that scanner as a PDF. But it is simply just an image as far as a machine reader is concerned. So that this requires an extra step called Optical Character Recognition, where you're actually analyzing the image and extracting the usable searchable, selectable text from it. So you have to run this OCR process to extract that text, which can take it can take a little bit of extra time, but this is just sort of the nature of the of the format. If you're dealing with scanned images then you have to take the time to make sure that you are extracting the content from it in order to make it machine readable. Another thing to consider is are there form fields present? So documents PDF documents that are intended to be fillable forms oftentimes will have form fields present within them. And this is another element that does take a little bit of extra care because form fields need to have what are called Tooltips applied to them. So if you find that there are form fields in the document, that's another thing to be aware of and another thing to sort of budget your time and your workflow around… making sure that you that you take the time to properly handle those form fields. And then one of the last things that I like to look at are the images. So if there are images present in the document, which many times there will be, how many images are there? About how many images per page are there? I mean, are these images informative are they technical or are they something that are just decorative? Now decorative images could be things like repetitive logos… they could be like watermark… or background images… they could be stock images. So some sort of brochure where you just have two people bending over a computer smiling at the computer screen… These are sorts of like stock images that are not actually adding any content a contextual value to the document itself. They're simply there for the aesthetics and to take up some space and maybe balance out the page. So not all images require alt text, but that's up to the remediator to determine… If there are images, what kind of alt text and what kind of effort do you have to make to apply that alt text? Perhaps the image… the document is full of technical images. Maybe this is like engineering content where you have a bunch of complex diagrams and graphs and charts. If you're just the remediator and you're not the subject matter expert, you might need to incorporate someone who knows more about that subject than you to help you with the alt text writing. So these are just things to consider as you are going through your remediation and as you're developing your workflow throughout that document. Now some collaboration ideas! Because Equidox is a web-based application, there's a collaboration aspect built into the tool. So we can actually work together and share documents through the Equidox platform. And because we work on a concurrent user licensing model, that means you can have multiple people logged into the account, into the same account simultaneously, and you can even have multiple people working together on the same document simultaneously. So Equidox gives you the ability to share documents with other users or groups and you can actually divide and conquer. So let's say you have a 200-page document and it's got a very tight deadline. You know, you'd like to get it posted by the end of the day. Well rather than one person having to hunch over their computer all day and kind of panic about this very large dense document, you might have five different users and everyone takes 40 pages. And then you can kind of power through that document in an hour or two, rather than having one person have to stress out and clear their calendar to make sure that they can get through it. So this is all made possible because this is a cloud-based application. Being able to share the document and even have multiple users working together simultaneously has a lot of advantages built into it. As I mentioned before, images could require subject matter experts so depending on your background or the nature of the document these images that are contained in the document might be so complex that you're not exactly sure how to provide an accurate alt description. So that again, the advantage of being web-based, is you could send the URL for the page over through an instant messenger or through an email to one of your colleagues, or one of your teammates who might be more familiar with the content, and just ask them to take a few minutes to provide alt text for those images. And then you can do the rest of the tagging of the documents. So from a structural standpoint, you can set your reading order, and your headings, and your lists and your tables, but those images that require that alternate text you might not be comfortable trying to provide that… If they are very technical or very complex images… so you can get someone else involved in that workflow. Because this is cloud-based, they can be working on that alt text while you are working on the rest of the document. And also in the Equidox application, there is a validation tool, and a Page Notes feature, so assuming that you're working together with a large team on a specific document, you can use the validation tool to essentially mark pages as complete. So that you can avoid redundancies… so someone isn't coming into a page and making adjustments and changing things that have already been marked as validated. And then also the Page Notes feature is a way of just leaving yourself or other users a note. Perhaps you skipped over an element because you weren't sure how to approach it, or you'd like to just have some sort of communication with your colleagues as to what happened on that specific page, or how you decided to approach it. Where they might be able to sort of replicate that process on other similar pages. And Equidox will give you a consolidated list of all of these various page notes that you take throughout the throughout the document. And whether you're an Administrator or a power user, and you have less skilled less trained workers working underneath you, you can very quickly look through the page notes and make sure that everything has been addressed. If there were questions, or if there were issues, you can always just follow up in the in the application and look at that specific page and make sure that that issue was addressed. And we'll go through that as well in the demonstration. Now eliminating repetition. So, if we have a lot of different tools built into Equidox that allow you to programmatically apply things to all pages. So, if you are an existing Equidox user or perhaps you've seen a demonstration before, you're probably familiar with our Zone detection slider. So this is a tool that will redefine or detect the reading zones within a document. And if you find that you have a pretty consistent formatting throughout the entire document, you can apply this to all pages. Now the same thing can be done for the reading order. So if you have, for example, every page in the document is a two-column reading order, you can choose two columns as that reading order and hit “Apply to all pages” and then Equidox will programmatically set that two column layouts to every one of those pages. We also have the ability to set up a heading template. So again, assuming that the document is consistently formatted and designed throughout, you can actually identify your heading hierarchy. So for example if you find that your heading level twos are all in the same font style, the same font characteristics, you can hit a simple checkbox and Equidox will programmatically look forward in the document and identify all of the different areas of the document where you have that same font style and will automatically apply that heading level two for you. And that can be done for all of your various heading levels. We also have an “Ignore” feature. The Ignore feature is the ability to bulk artifact repetitive elements in a document. So if you have a footer or a header (maybe it's like a revision date or a serial number, something that is just continuously repeating on every single page) you don't necessarily need to tag that on every page. And you can use the ignore feature, if it's located in the same location, to bulk artifacts all of that information throughout the entire document. So that you're not feeding this redundant information to the screen reader user page after page after page. Now another important part of this is to keep a rhythm. So, I find when I’m working on longer documents particularly, that remediation can be a very rhythmic process. So if you have similar layouts and designs throughout the document, a lot of those pages will require the same techniques. And once you get through a handful of pages, you will start to eliminate redundancies and consolidate steps. So that you can start just eliminating, literally eliminating clicks. And if you're if you're doing a high volume type of project where you have a lot of pages to get through, if you can eliminate a handful of clicks or a handful of steps on every page, that really adds up to a lot of time savings at the end of that document, or the end of that project. So, as you get into a rhythm, I find it's very helpful to try to stay within that rhythm. And then the outlier type of pages, perhaps you come across a page where you have a completely different layout, or there's a large infographic or a flow chart or something that is completely different than the pages that you've been previously working, on you can always skip over that page and just return to it at the very end of the process. So rather than disrupting that flow that you've just developed for the previous 10 or 20 pages, you can just skip over it and go back to that flow chart or back to that infographic that requires a lot of extra work. And then you can kind of finish off the document at the very end by addressing that outlier type of page at the very end of the workflow. Now another part of Equidox is validating as you go. And this is something that we will also cover in the in the demonstration when we get to it. So the HTML preview, if you've used Equidox before I’m sure you're familiar with this, but the HTML preview is a way of checking the structure that you've built on the page in terms of reading order, your headings, lists, tables. All of that information that you have structured in that remediation process will be previewable in the HTML preview page. And this is a way of just checking your work before moving on to a subsequent page. So, you'll get a lot of feedback from the application as you're going through at the page level. And if you're struggling with a certain page, you can always revisit it and finish it off at the very end. So, it's just a way of checking your work as you go. Also when you are finished working through a document, when you export the PDF, Equidox will run a series of accessibility checks on that export. Now the bullet points here on the slide… this is not a comprehensive list of everything that it is checking for, but just a few examples of the things that it's looking for are missing alt text for images… if you have illogical reading orders or logical heading structures… invalid merges… for example, if you are trying to merge a list with a text zone, for example… And then the errors that Equidox finds throughout this accessibility check, they are directly linked to the specific page where it was found on. And they give you a clickable link where you can just select the link and go straight to the page where that error was found. And then you can make your correction in Equidox and then just re-export the document once you've worked through all of those errors. So this accessibility check that we are applying on export, as I said it's not a fully comprehensive accessibility report that you are getting… We still highly recommend using third-party validation tools and especially screen readers. Screen readers to replicate exactly what a screen reader user is experiencing when they're working with this document. Now automated accessibility checking… it can be automated to a certain degree but there are many, many different elements that it can miss because it's simply looking for technical requirements. But it's not actually able to replicate what an end user is going through with a screen reader. So it's very important to make sure that you are that you are checking your work as you as you export the document. Putting it through various third-party tools, or using screen readers which is the best and the ultimate check for the actual usability of that document. Okay, so let's jump into Equidox now, and we'll work through a couple of examples here… and I realize that we’ve got about 10 minutes left. So I’m going to go pretty quickly through these documents. But that's sort of the point with Equidox, is to be able to move very efficiently through them. So I have a couple of documents pre-loaded here into Equidox and I want to start with this one here. And this is just a one-page document. So as I said before, when you have a one-page document, sometimes it's best to just jump right in and see what you have rather than trying to spend a few minutes to be very strategic and tactical with it. You can just approach this page and just see what you've got in here, and start working through it. Now what I notice right away is I have some images, I have some headings, I have text, I have a list here, ( in fact, a nested list), and a table down here at the very bottom. So just to quickly work through this, what I’m going to do is I’m going to start by, at the top and just start artifacting images that I don't really need. I don't know what this is, just a background sort of watermark type of image… Here's another image of black and white dog holding a briefcase… here's our Equidox logo up here. So I’m just typing in all descriptions for them. Now I’m going to work on my heading structures. So I hit… everything right now is currently set as a text zone because this document was untagged to begin with. So if I just start hitting my appropriate keyboard shortcuts to set my headings. Heading level 1, heading level 2, heading level 2, and heading level 2. Now I’ve defined my headings for this page. Next up is this list. As you can tell this is a nested list, and what I’ll do is I will hit “L” on my keyboard and I will use my List Detection slider. So if I move this over, Equidox is going to look inside of this list and it's going to pick up these nested layers inside of the individual list items. So if you can tell, or if you've used Equidox before, you'll notice that this list has been automatically detected, and the nestings have been detected inside of those individual list items, simply by moving this list detection slider. So this is using artificial intelligence and machine learning to better understand list structures, and something like this only takes a few seconds. Now this down here is a table, and if you can tell I have a bunch of different text zones covering up this table because it has no existing tag structure. So I’m not going to worry too much about those existing zones. I’m just going to draw a single zone over the entire table, hit “T” on my keyboard and open up the Table Editor. Once I open up the Table Editor, I’m going to then use my detection tools here. So again, the computer vision and machine learning to redefine the table structure. So the individual rows and columns and cells have been all identified through our artificial intelligence. And then I can actually use the spanning feature to span across here for these various years. I’m just holding Shift and pressing “S” on my keyboard. And then this is an example of a table where I actually have two rows of column headers so 2017 and 18. And then the various Q1s through Q4s… these are also going to be read as column headers. So, I will just change my column header from the standard one to two. And then if I look at my HTML preview of this table, what I see is a nice clean accessible HTML table. And then the great part of Equidox is that it will take this table and it will automatically convert it into a PDF tag tree. Now when I save this table and close out, all of those individual zones that were previously there have been artifacted. So I’m just left with the single Table zone. So, the last thing I need to do now is just reorder the page. Because I’ve removed some zones and I’ve drawn some zones manually, my reading order is a little bit out of sync. So, all I’m going to do is go to the Page Tab and press “Reorder.” And I can always check my HTML preview just to validate that I’m satisfied with everything that I see. This all looks pretty good to me, so what I’ll do is I will just save the page. I can mark it as validated if I'd like to, just to remind myself that I’ve already finished this> And if I go to the Output Tab, I’m then able to generate the PDF. Now when I generate the PDF, hopefully I’m not going to get any errors, and in fact I did not. But now the PDF opens up in a separate tab for me in my browser. If you can tell, nothing is visually changed about this document. It's exactly the same, but all of the tag structure that I’ve just set up during that remediation process is going to be automatically converted into those PDF tags for me. Now with just a few minutes left, I want to take a look at a bit of a longer document. And this is a document here where it's quite a bit older. It's been it was created in 2007, and actually this document has an OCR page. So this kind of incorporates a few different elements into it one place. I like to start, when working with longer documents… Of course I’ll take a look at the thumbnails for each of the pages and just try to get an understanding for what is this document actually made up of. But then one thing I like to look at is the Images Tab. Now the Images Tab, I can see here there are quite a few images in this document. And they are somewhat technical, because these are screenshots and sort of instructional images. And some of the images already have alt text, whereas others do not. So this is something to pay attention to as you're going through page by page, making sure that your images have all been described or in some cases artifacted if you do not feel that they need to be given a description. So this is something where I’m not really able to tell, based on the context of the document, what this image is trying to represent. So I don't feel comfortable providing alt text quite yet. I’ll wait till I get through the document and go through page by page. Since I have OCR here on this very first page, I’m just going to use my Zone Detector. This Zone Detection tool is something that we have not yet demoed, but it's definitely a huge part of the Equidox system is using that Zone Detection tool. And you can redefine the granularity of the reading zones. I have this background image, so I’m going to just get rid of that background image, and I’m going to OCR all zones. So, I’m going to convert this scanned page. If you can tell, this is actually scanned I’m actually going to convert it now into selectable searchable usable text. And once I’ve done that, I can then set my heading level. So this is a heading level one, because it's on the very first page, nice big bold font. And you can check your HTML preview. So I’m pretty satisfied with that page I can then move on to the next. Now if we had the if we had more time I would basically just be working through this document very quickly. A page like this, where I had existing tag structure of one big P tag covering the entire page… this can be very quickly adjusted, setting my headings and setting my list down here. So if I just move my List Detector from left to right that will define this list with all the individual elements. I can check my HTML preview… perhaps I want to get rid of these headers and footers, for example, like for the revision date… This is the type of information that can just be very redundant… So, using the Ignore feature as I mentioned before, if I just draw a zone that's intersecting with that zone, I press “I” on my keyboard, that will become an Ignore zone. And then if I go back to the preview, the revision date is no longer part of the tagged elements. So if I go forward into the document and I look at other pages, the Ignore zone is already waiting for me. So it's already there, ready to remove that revision date. Now this is a 14- page document, and working through it I would spend probably about 30 seconds per page on average to get through this. So it would be about a five- to ten-minute document to get through using some of the steps that I’ve already explained and described. You know, the Ignore feature, setting up a heading template, using the auto detection feature of the zones, you can very quickly get through content like this. Because we've taken the time to be a little bit more strategic in how we approach it. So you don't have to remember to remove this on every page, you don't have to set your headings on every page, because you can do it programmatically through the heading template. But this is what a sort of a more refined workflow can look like when you're working through longer documents. And even longer than 14 pages. Many times, you might have documents that are hundreds of pages, and so the time savings can really be multiplied over those various pages. And then when you export a document like this, (keeping in mind that I have not addressed every page, this is going to actually produce some error messages for me) so this is where I just wanted to show you that, for example, those missing images and as well as some other elements that I have not yet touched, these are going to populate in this in this error list for me. And if I were actually remediating this document, I would be able to go to the list of errors or warnings, and I'd be able to click on these warning messages and go directly to the page where they're located, fix them and then regenerate the PDF. Now I realize that we are just about out of time, and I want to be respectful of everyone else's time, so I’m actually going to just jump back to the slide deck here. And I will just say thank you again for everyone for joining today. As I said before if you have any specific questions or feedback, we would love to chat with you and talk about how Equidox can fit into your PDF remediation workflow. We can do another more specific demonstration, working through some of your documents if you'd like, and talk about how to define and refine a workflow for yourself. For more information about how Equidox Software Company can help you with PDF accessibility Email us at EquidoxSales@equidox.co Or give us a call at 216-529-3030 Or visit our website at www.equidox.co

PDF Remediation - Where do I start?

It can be tough to know where to start when addressing your inaccessible PDF documents. This webinar will cover techniques and workflows for evaluating and prioritizing elements of your documents, and best practices for using Equidox software.

Let’s talk!

Speak with an expert to learn how Equidox solutions make PDF accessibility easy.