Dan Tuleta hosts special guest, David Freelan, an Equidox data scientist, to talk about Batch Processing. This is a fully automated PDF remediation process for large quantity similar documents like bank statements and recurring reports. David will discuss his artificial intelligence developments and how these can make bulk PDF remediation completely automated (and completely accessible!).
Slide Deck for Batch Processing Webinar
Equidox. By Onix. Reach Everyone
[Dan Tuleta] Okay so it’s just about two o’clock, so I think we’re ready to get started. For everyone that’s still shuffling in, all this is being recorded so you won’t miss too much in the very first minute or two. Thank you all for attending. Welcome again for those of you who have attended our webinars before. This is the next installment of Equidox Webinar Wednesdays.
Okay so welcome everyone to Equidox Webinar Wednesdays.
Talking about our newest service, batch processing. As always we do appreciate your attendance and if you have any questions that we do not answer during this presentation, please feel free to reach out to us…
Thank you again for joining us today, and if you have any questions throughout this presentation that we do not cover, please do not hesitate to reach out to us either through our website at www.Equidox.co. You can call us directly at 800-664-9638, or the best way probably to reach out initially would be through our email which can be found at EquidoxSales@Onixnet.com. We’re also very active on LinkedIn and social media, so if you are on LinkedIn please feel free to connect with us. We post a lot of articles and information about our product services and the accessibility space in general.
So just a couple of quick notes about us before we get into the nitty-gritty details of batch processing. For those of you who might only know us by Equidox, we actually have a parent company called Onix Networking and Onix was founded in 1992 in Cleveland, Ohio. So we’re approaching nearly 30 years of business in the IT space. We’re best known for our partnerships with Google and AWS (Amazon Web Services). So we are really just an all-around cloud consultancy and cloud experts, with our mission being to improve organizational efficiency through cloud computing solutions. So if you if your organization outside of accessibility has any interest in having conversations about cloud, about cloud software, or cloud technology please feel free to reach out to us as well. We can definitely direct you to the correct people within Onix.
Now Equidox is a branch of Onix focusing on accessibility. So many of you are probably aware of our best-in-class PDF remediation software called Equidox. We also offer PDF remediation services where clients will send us PDF documents that they need to have remediated, Our team of remediators and validators will make those documents accessible, and then we will return those documents to our clients, and then they can be posted online in an accessible format. Obviously mitigating all legal risks to comply with accessibility laws and regulations. We also offer expert accessibility services which can include testing, training, VPAT completion, auditing websites, helping you put together accessibility plans for your organization. And our mission is to ensure that digital information reaches everyone through accessibility solutions.
So here are just a few customers who we support either through Onix’s cloud business or through Equidox, or possibly a combination of both. So we work with organizations of all sizes in every vertical, so if you have PDF documents and you need to make them… and you need to make them compliant, the Equidox team can definitely help you improve your workflow. So like I said, any sort of PDF challenges that you may have we are probably the people to talk to. So please get in touch with us either way, even if batch processing is not right up your alley. We do have Equidox software for ad hoc PDFs, and we also offer those services that I just mentioned.
Now, this week is a little bit different. So if you’ve joined us before for our Equidox Webinar Wednesdays, you’re probably used to me talking and demoing something about the Equidox software. But this time I’m actually co-presenting with my friend David Freelan. And he is on the Equidox team, and he is our lead data scientist. So I’m going to turn this over to David to introduce himself and talk a bit more about batch processing.
[David Freelan] Thank you, Dan. So I’m going to tell you a bit about myself. First, I started my journey at George Mason University. I got my graduate degree in AI and machine learning. I got recruited then by a robotic soccer team, and we represented the United States in a competition called Robocop. I worked my way to team lead, and we competed in both Brazil and China. I went on to publish in papers and deep learning and multi-agent systems. After my degree, I actually had a connection at the Cleveland Clinic that led me to Equidox, and I thought working for a company doing accessibility and trying to get everybody included sounded like a great way to apply my skills. So I found myself very motivated to contribute to Equidox’s goals.
So wrapping things around robotics has a lot of different components to it, but what’s the most relevant for us today is computer vision techniques, which try to figure out, let’s say an example for the soccer stuff where a ball or a robotic player is all that visual analysis of what’s going on on a soccer field, can actually be applied to analyzing a document in front of you.
So my initial contribution to machine learning and Equidox was our list and table editor. So when making those documents accessible, in a lot of situations our remediators have to specify where every single row and every single column in a table is, as well as highlight each item in a list. I can use the bullet points on this page is an example. A remediator would have to go through, tag this whole thing as a list, tag each and every single item one by one. So all six items here, and they’d have to make sure they tag each and every single delimiter. So just for this tiny little list, we have a lot of different things we need to make sure is tagged. So we really want to automate as much as this as possible so that we save a lot of business time.
So the way we trained our machine learning models, we can actually take that manual human process that’s been done many many times by our remediators, and we can teach a machine learning model based on those hundreds of thousands of examples. We can create and we can then use that to consistently detect these tables and lists and skip a lot of all that manual labor there. So while we use our internal documents to train these machine learning processes, when we apply ourselves to batch processing for a company, we’re going to use your documents to train. We don’t want a general a one size fits all.
We’re going to take your documents and set ourselves to identify the elements in there as accurately as possible. If you have a few documents out of thousands or even millions of documents that look a bit different, we want to make sure that we use every machine learning tool at our disposal to find those little outlier documents, which we’ll take a closer look at later. We want to give you a sneak peek on how that’s done so effectively. But first, let’s take another look at our goal state.
Here… (oh that’s the next slide) Yeah, here’s our goal state here. So we want to go ahead and identify each of these single zones. We have an icon on the top left here for your bank, we have a table with a header, and the table has headers inside it. We got all sorts of things going on. Now, this is just an internal representation of what’s going on on our end. Don’t worry, we’re not going to be modifying your document in any way. So before we… but before we train our models, we need to do a couple of internal steps first. We’re going to have to get… gather these documents and create all these templates, and I want to show you how we do that.
But first, let’s get just a few definitions out of the way, and then we can tackle a problem together. So artificial intelligence is just a general broad scope of algorithms that are used to mimic human-like intelligence on any particular task. More specifically, we can be interested in computer vision, so we’re trying to mimic say the visual cortex of a human and identify whether something is a cat or a dog or whether something’s a table or a list. Machine learning is a different subset of artificial intelligence which is about teaching a model instead of like, hard coding. You can teach a model over time, so where this intersection is, is where you’re teaching a computer to see. So that’s kind of our baseline vocabulary here.
And now that we have that out of the way, let’s go ahead and try to be data scientists together for a second, and take a look at an example company. So we can take out a machine learning algorithm, and we can teach it how to group similar documents together. So in this example on the right, each dot represents a single page of a document. And the closer those dots are, the more similar that page is.
So to show you a bit a better example of what I mean, let’s zoom in on a cluster here in the middle. Let’s take a look at that. So in this cluster, we have a lot of pie charts. It would appear if you go from the top left, the pie chart appears near the bottom… as you go toward the bottom right of that graph, the pie chart kind of sneaks up toward the top. So just by looking at this cluster, zooming in, and clicking on three different documents, we already have a really good idea. This gives our machine learning algorithm and our developers to see the variation in these documents, just by looking at three.
So we have the opportunity now to take a look at some of those points that are way on the outside… that are really far away from all the other points. We call these sort of the outlier points, and we see an interesting one here where the pie chart only has one category. And this could be something that a machine learning algorithm missed, or honestly, a human who is handwriting a template could totally miss out on this situation. If there’s thousands or millions of documents, this is just potentially one in a million that could be hard to find. So our machine learning algorithm was able to find that outlier, and handle that.
All right so there’s one other thing I actually want to note on this page. There’s a 100 there that’s right on top of the pie chart. That can actually kind of blend in with that paragraph above it. So again, without noticing these sorts of things, and going through this initial machine learning process, there’s a good chance we would miss that 100 percent. And missing something means that someone isn’t included. So we want to make sure we get everything.
Let’s take a look at this real fast. So this cluster we’ll be zoomed into has a lot of letters on it. I wanted to show here that while we obviously doing a lot more than a fixed overlay, in this cluster, there are some static elements that say, the logo on the top, and the address, that are all in generally the same location. We can actually take advantage of that and make sure that our algorithm focuses on getting these sorts of variations here. As we look at these three individual points, we can see that these paragraphs can vary in length. Some of this header can vary in size, so we need to make sure that we account for that. We sort of have a mix of a static and dynamic thing going on in this cluster.
Okay so in these three clusters, these clusters appear to be closer together because they have a sort of similar style, but each cluster has a different number of columns. So what I wanted to show you all here, is how each document even though they have a differing number of columns, they have a lot of similar features. So actually instead of building three different templates, one for each cluster, I could actually split up the document by their columns and create one template that can handle columns. And I could sort of combine all these clusters together. And by graphing all these out, gave me a good view of how these differ and how these documents are so similar to each other. So machine learning algorithm really really helped us learn a bit more about these documents.
So the last kind of document I wanted to show you today is where the tables vary in their structure. So at the top left here, we have very few cells compared to, if we go all the way down on the graph, on the lower right we could see a lot of very dense cells. So we want… our machine learning algorithm here is informing us that this is primarily how the document is going to change. That the amount of cells in this table. So we want to make sure that our machine learning algorithm not only identifies these PDF elements, but is able to tag them correctly despite these variations. So we have our own table algorithms to make sure that we handle this appropriately.
And let’s again let’s take a look at another outlier point because this could be really important example. So in the top left, this could be a bug on this company’s end, or it could be something that’s an outlier that we need to make sure that we can handle. But there’s a one on the top left that’s completely missing a table. There’s no cells at all. So this is another thing that if those… if this cluster had a million documents in it without something like this algorithm you might never find it looking for it by hand. So this is a case where we could either alert you that this customer might have an issue, or we can just keep that in mind to ourselves. If you say it’s not a problem, and make sure that we are handling that edge case appropriately.
So once we’ve done all of that we can build ourselves a lovely template. This is the same thing we saw on the last slide. Sort of a reminder of what our goal state. We’ve now been able to show you how we can identify all these different elements on the page. And this is what internally (again we’re not marking your documents) how internally all these things will be tagged. They will make sure we apply them correctly so that their tags are compliant, accessible and usable. Everything you’d want to see.
So now you might be wondering, “cool, we’ve made all these templates, we’ve done all this fancy machine learning stuff, how are we going to get these documents to your company?” You have documents, you want them remediated by our batch processing system… How’s that going to be, going to work for you?
So what we want to supply for your developers is an interface where you can call a rest API. And with that rest API… that rest API on our end is going to interact with either a local server or a cloud server that can be scaled up and down. Whether it’s local- or cloud-based is your preference.
So once we have that PDF, we can go along and forward that to the machine learning algorithm that we have just described in great detail. And after the machine learning has identified the templates, and tagged them all correctly, we can send that simply right back to you through the same rest API. And you have yourselves a tagged and accessible document for your end-users.
Okay, so that concludes this. Dan, if you can take it away, we have a few FAQs.
[Dan Tuleta] Sure thank you David. So when we are talking about batch processing with our prospects and existing clients, there’s a lot of common questions that come up. So what we’ve tried to do within this presentation is to just compile a list of these FAQs. And we’re going to walk through a few of those common questions that we feel that many of you on this call are probably wondering right now. So David if you want to jump to the next slide and we can get started.
Okay so David, is this really accessible?
[David] Yes, yes it is. So it’s not just my expertise that’s going in into this. I’m not a one-man show. We have our Director of Accessibility and his entire team that’s going to be going through with our developers. And all the machine learning process here to make sure that we’re complying with everything. We don’t want to leave anybody out.
[Dan] Right. Our PDF remediation team is full of PDF experts. When we send documents back to clients as part of our PDF remediation services, everything goes through a multi-step validation process. So every document, every template that we’re designing for you, is going to go through rigorous validation and this is not just trying to trick the accessibility checkers into saying that this document is accessible.
We are actually building a fully tagged, fully accessible, fully usable document for your clients. Now again this is not an overlay so these are not static documents that have to be exactly the same way, and then you just place or copy and paste the exact same template onto every document. These documents might have minor variances, as David has explained. Even if that’s the case our machine learning algorithms are able to detect that and every document will be tagged in a unique way to make sure that it is fully usable.
So David, a lot of people are wondering about these templates. You know, how many templates can you have? What if they need to change? And if they do need to change how long do these changes take to apply?
[David] Yeah so the number of templates is very fluid. I wouldn’t necessarily be too concerned about our number, because templates can vary in difficulty. So that’s the thing we can do on a case-by-case basis. If they change it’s, again it depends on how complex the changes. I imagine a lot of changes can vary in difficulty.
So that’s again, unfortunately, one of those things we do tackle on a case-by-case basis. And the length of time is again, it’s one of those unsatisfying answers potentially. But yes we want to make sure we’re working with you, and the length of time it takes us to complete the projects is going to depend on your needs.
[Dan] Great. All right. That’s just one thing to keep in mind, is that all of these are custom solutions. So that’s going to be sort of a theme throughout the FAQs. We really do need to engage with you and your team on a one-to-one basis and understand your documents, your templates, and all of your needs to come up with a custom solution that will fit for all of your documents.
So David, can the system run multiple templates simultaneously?
[David] Yes that’s really not a problem. Running an arbitrary amount of templates is not an issue for us. We just need enough processing power to do it. And we don’t think it’s going to take that much processing power to do. So it’s really, really not a problem for us.
[Dan] Great, all right. And so David, of course, a very important question. Is this process fully automated? Or does it need any sort of human hand-holding or babysitting on a day-to-day basis?
[David] It is absolutely automated. It’s just a matter of setting that those initial conversations and creating those templates. And from there it should just go smoothly.
[Dan] Great. Once we’ve had those initial discovery calls. and we better understand your templates, your documents, and we can develop those templates, we can take the human involvement completely out of the equation and it all just works like magic.
So another question that we’re often asked is, should this be an on-prem or a cloud-based type of solution?
So in talking with our developers, both options are available. So we can set up this system either in the cloud or locally installed on your servers. Our developers have recommended for this type of system that on-prem is going to be faster because you won’t have data that has to ping back and forth from the cloud going both directions. So everything can stay local in your own environment.
And also this might help you meet some of your internal security requirements. So a lot of these documents might contain client data, or information about specific people, or account numbers, for example. So there might be a lot of very strict and rigorous internal IT security measures that you have to adhere to. So keeping it locally installed on your server in your environment might be a better option. But if you are interested in having a conversation about how to install this on the cloud, that’s absolutely fine. We can support either option.
Now David, what kind of volume and scalability are supported here?
[David] Yeah, it’s arbitrarily large. You can have a billion documents going through and we can handle it. You just need the hardware to do it. But we don’t exactly expect you to have to rent a supercomputer or anything to do that sort of thing. But it’s again, one of those case-by-case things. So we’ll work with you to determine on exactly how big of a machine you might need.
[Dan] Right there can be a lot of variance between the size of the client. Whether you’re a small credit union, or a small utility company, or a Fortune 500 company. You could be talking about thousands of documents a month, or maybe millions of documents per month. But we can scale accordingly. It really is just dependent on the hardware that you are trying to run this through. But that again is a conversation that we can have offline once we’re having a one-to-one discussion with you and your team.
Now another question that we’re often asked is, is it secure?
And yes it is secure. Whether you go with the cloud option or the on-prem, it is in fact a very secure system. So many people might opt for the on-prem install just for that extra layer of security, keeping everything installed locally on your own environment to adhere to all of your internal IT security rules and regulation.
But just keep in mind that Equidox is a member of the GSA schedule, California’s CMAS, the Canadian SLSA, and we’ve also passed our ISO certification as well. So keep those in mind when you take everything away from this presentation. That this is in fact a secure solution, and we can have a more detailed discussion about any of your specific requirements when the time comes. I didn’t even notice you changed the slide, David, thank you.
So what does it cost?
This is of course the million-dollar question. So everyone wants to know what will this cost. And it really can vary depending on the complexity of the templates, the number of templates that you have, the volume, and the hardware that you are running it on. So there’s a number of factors that go into the price and we know that this has to scale up and down based on a lot of different factors.
So what we want to do is of course have a conversation with you to talk about your exact needs… talk about your documents… better understand the situation for your organization. And then we will come up with a custom pricing solution that will work for both parties. So this has to be kind of discussed on a one-to-one basis. There is no just one size or one price fits all type of solution here. Everything is custom keep that in mind.
So in summary that kind of wraps up our FAQ.
Equidox batch processing is a customized service for our clients to produce fully accessible documents in a fully automated way. Without any day-to-day human involvement. So this allows for on-demand accessible PDF creation. So that when any of your customers download a PDF, whether it be their bank statement, or a utility bill, or any sort of templatized document, Equidox batch processing will automatically produce that document in an accessible format. So our batch processing solution is a fully secure, fully scalable solution to meet your needs.
Whether you’re producing a few thousand or a few million documents, this system will take the burden of page-by-page remediation out of your hands. It will also ultimately ensure that there is inclusion for all of your clients, and it will mitigate your legal risk of complying with all of the accessibility requirements for your organization. So that is going to conclude our presentation today. As always please feel free to reach out to us either through our website www.Equidox.com about Equidox in general.
You can definitely surf through our website and of course, we are very active on… [silence]
[Tammy] Dan appears to be having some connection problems. We just want to share that we are very active on social media. You can find us on LinkedIn or Facebook or Twitter. You can follow our business pages there and get any updates that you might need or be interested in. And we’re really always trying to keep new material going. And you’ll see announcements for our next webinar there. I really want to thank you for showing up today David.
[David] My pleasure.
[Tammy] That was a really cool explanation. Is there anything you’d like to say to the crowd before we close up today?
[Dan] I think I dropped my internet. I do apologize if I was in the middle of a sentence. But thank you for picking things up Tammy.
[Tammy] Yeah no problem. We were just giving David a moment to say his last hurrah here.
[David] Yeah I’d like to thank you all for listening. And I hope you have a good reassurance that we are doing our best to make sure everything is remediated as accurately as possible. And we are not slacking!
[Dan] Great thank you, everyone!
[David] Thank you!
END OF TRANSCRIPT.
Tammy Albee | Content Marketer | Onix Tammy joined Onix after four years experience working at the National Federation of the Blind. She firmly maintains that accessibility is about reaching everyone, regardless of ability, and boosting your market share in the process. "Nobody should be barred from accessing information. It's what drives our modern society."