Automated High Volume Solutions
An Equidox data scientist explains how high-volume automated solutions work.
An Equidox data scientist explains how high-volume automated solutions work.
[Dan Tuleta] okay so it's just about two o'clock so I think we're ready to get started for everyone that's still shuffling in all this is being recorded so you won't miss too much in the very first minute or two so but thank you all for attending welcome again for those of you who have attended our webinars before but this is the next installment of Equidox Webinar Wednesdays I Heard a bit of background noise there I'm not sure if someone was off of mute but okay so welcome everyone to Equidox Webinar Wednesdays talking about our newest service batch processing so as always we do appreciate your attention if you have any questions throughout this presentation that we do not cover please do not hesitate to reach out to us through our website at equidox.co we're also very active on LinkedIn and social media so if you are if you are on LinkedIn please feel free to connect with us we post a lot of Articles and information about our product services and the accessibility space in general now this week is a little bit different so if you've joined us before for our Equidox webinar Wednesdays you're you're probably used to me talking and and demoing something about the Equidox software but this time I'm actually co-presenting with my friend David Freeland and he is on the Equidox team and he's our lead data scientist so I'm going to let turn this over to David to introduce himself and talk a bit more about batch processing [David Freelan] thank you Dan so I'm going to tell you a bit about myself first I started my journey at George Mason University I got my graduate degree in Ai and machine learning I got recruited then by a robotic soccer team and we represented the United States in a in a competition called RoboCop I got my way to team lead and we competed in both Brazil and China and went on to publish some papers and deep learning and multi-agent systems after my degree I actually had a connection at the Cleveland Clinic that led me to Equidox and I thought working for a company doing accessibility and trying to get everybody included sounded like a great way to apply my skills so I found myself very motivated to contribute to equidox's goals so wrapping things around robotics has a lot of different components to it but what's the most relevant for us today is computer vision techniques which try to figure out let's say an example for the soccer stuff wear a ball or a robotic player is all that visual analysis of what's going on on a soccer field can actually be applied to analyzing a document in front of you so my initial contribution to to machine learning and Equidox was our list and table editor so when making those documents accessible in a lot of situations are our remediators have to specify where every single row and every single column in a table is as well as highlight each item in a list I can use the the bullet points on this Pages as an example a remediator would have to go through tag this whole thing as a list tag each and every single item one by one so all six items here and they'd have to make sure they tag each and every single it's a limiter so just for this tiny little list we have a lot of different things we we need to make sure tags so we really want to automate as much as this as possible so that we save a lot of business time so the way we trained our machine learning models we can actually take that manual human process that's been done many many times by our remediators and we can teach a machine learning model based on those hundreds of thousands of examples that that we can create and we can then use that to consistently detect these tables and lists and skip a lot of of all that manual labor there so while we use our internal documents to train these machine learning processes when we apply ourselves to batch processing for for a company we're going to use your documents to train we don't want a general a one-size-fits all we're going to take your documents and fit ourselves to to identify the elements in there as accurately as possible if you have a few documents out of thousands or even millions of documents that look a bit different we want to make sure that we use every machine learning tool at our disposal to find those little outlier documents which will take a closer look at later we want to get give you a sneak peek on how that's done so effectively but first let's take another look at our goal State here oh that's the next slide yeah here's our goal State here so we want to we want to go ahead and identify each of these single ohms so we have a icon on the top left here for your bank we have a table with a header and a and the table has headers inside it we got all sorts of things going on now this is just an internal representation of of what's going on on our end don't worry we're not going to be modifying your document in any way so before we but before we train our models we we need to we need to do a couple of internal steps first we're going to have to get gather these these documents and create all these these templates and I want to show you how how we do that but first let's get just a few definitions out of the way I know we can tackle we can tackle a problem together so artificial intelligence is just a general broad scope of algorithms that are used to mimic human-like Intelligence on any particular task for more specifically we can be interested in computer vision so we're trying to we're trying to mimic say the visual cortex of of a human and identify whether something is a cat or a dog or whether something's a table or a list machine learning is a different subset of artificial intelligence which is about teaching a model instead of like hard coding you can teach a model over time so where this intersection is is where you're teaching a computer to see so that that's kind of our Baseline vocabulary here and now that we have that out of the way let's go ahead and try to be data scientists together for a second and take a look at an example company so we can take out a machine learning algorithm and we can teach it how to group similar documents together so in this example on the right each dot represents a single page of a document and the closer those dots are the more similar that page is so to show you a bit of a better example what I mean let's zoom in on a cluster here in the middle let's take a look at that so in this cluster we have a lot of pie charts that would appear if you go from the top left the pie chart appears near the bottom as you go toward the bottom right of that graph the pie chart kind of sneaks up toward toward the top so just by looking at this cluster zooming in and clicking on three different documents we already have a really good idea for you know this gives our machine learning algorithm and and our developers to to see the variation of these documents just by looking just by looking at three so we have we have the opportunity now to take a look at some of those points that are way on the outside that are really far away from all the other points we call it these sort of the outlier points and we see an interesting one here where the pie chart only has one category and this could be something that that a machine learning algorithm with or honestly a a human who is handwriting a template could totally miss out on on this situation if there's thousands or millions of documents this is just potentially one in a million that could be hard to find so our machine learning algorithm was able to find that outlier and handle that all right so there's one other thing I actually want to note on this page there's a 100 percent there that's right on top of the pie chart that can actually kind of blend in with that paragraph above it so again without noticing these sorts of things and going through this initial machine learning process there's a good chance we would miss that 100 percent and missing something means that someone isn't included so we had to make sure we get everything let's take a look at this real fast so this cluster that we zoomed into has a lot of letters on it I wanted to show here that while we we are obviously doing a lot more than a fixed overlay in this cluster there are some static elements that say the logo on the top and the address that are all in generally the the same location we can actually take advantage of that and make sure that our algorithm focuses on getting these sorts of variations here as as we look at these three individual points we can see that these paragraphs can vary in length some of this header can can vary in size so we need to make sure that we account for that we sort of have a mix of a static in the dynamic thing going on in this in this cluster okay so in these three clusters these these clusters appear to be closer together because they have a sort of similar Style but each cluster has a different number of columns so what I wanted to show you all here is how each document even though they have a differing number of columns they have a lot of similar features so actually instead of building three different templates one for each cluster I could actually split up the the document by their columns and create one template that that can handle columns and I could sort of combine all these clusters together and by by graphing all these out gave me a good view of of how these differ and how these these documents are so similar to each other so machine algorithm really really helped us learn a bit more about these documents so the last kind of document I wanted to show you today is where the tables vary in their structure so at the top left here we have very few cells compared to if we go all the way down on the graph on the lower right we can see a lot of very dense cells so we want our machine learning algorithm here is informing us that this is primarily how the document is going to change that that the the amount of cells in this table so we want to make sure that our machine learning algorithm not will identify these PDF elements but is able to tag them correctly despite these variations so we have our own table algorithms to make sure that we handle this appropriately and let's again let's take a look at another outlier point because this could be really important this example so in in the top left this could be a bug on on this company's end or or it could be something that's an outlier that we need to make sure that we can handle but there is the one on the top left that's completely missing a table there's there's no cells at all so this is another thing that if this was if this cluster had a million documents in it without something like this this algorithm you might never find it looking for it by hand so this is a case where we can either alert you that that this customer might have an issue or or we can just keep that in mind to to our to ourselves if you say it's not a problem and make sure that we are handling that edge case appropriately so once we've done all of that we can build ourselves a lovely template this is the the same thing we saw on on the last slide sort of a reminder of what our goal state is We've we've now been able to to show you how we can identify all these different elements on the page and this is what internally again we're not marking your documents how how internally all these things will will be tagged they that we'll make sure we apply them correctly so that their their tags compliant accessible and usable everything you'd watch here so now you might be wondering cool we we've we've made all these templates we've done all this fancy machine learning stuff how are we going to get these documents to your company you have documents you want them remediated by our batch processing system how's that going to be going going to work for you so what what we want to supply for your developers is an interface where where you can call a a rest API and with that rest API that rest API our end is going to interact with either a local server or a Cloud Server that can be scaled up and down whether it's local or cloud is based on your your preference so once once we have that PDF we can go along and forward that to the machine learning algorithm that that we have just described in in great detail and after the machine learning has identified the templates and tagged them all correctly we can send that simply right back to you through the same rest API and you have yourself a tagged and accessible document for for your for your end users next slide okay so that concludes this Dan if you can take it away we have a few FAQs [Dan Tuleta] sure thank you David so when we are talking about batch processing with our prospects and existing clients there's a lot of common questions that come up so what we've tried to do with within this presentation is to just compile a list of these FAQs and we're going to walk through a few of those common questions that we feel that many of you on this call are probably wondering right now so Dave if you want to jump to the next slide and we can get started okay so David is this really accessible [David Freelan] yes yes it is so it's not just my expertise that's going in into this I'm not a one-man show at our we have our director of accessibility and his entire team that's going to be going through it with our developers and all the machine learning process here to make sure that we're complying with everything we don't want to we don't want to leave anybody Out [Dan Tuleta] Right our PDF remediation team is full of PDF experts when we send documents back to clients as part of our PDF or mediation services everything goes through a multi-step validation process so every document every template that we're designing for you is going to go through rigorous validation and this is not just trying to trick the accessibility Checkers into saying that this document is accessible we are actually building a fully tagged fully accessible fully usable document for your clients now again this is not an overlay so these are not static documents that have to be exactly the same way and every and then you just place or copy and paste the exact same template onto every document these documents might have minor variances as David has explained even if that's the case our machine-learning algorithms are able to detect that and every document will be tagged in a unique way to make sure that it is fully usable so David A lot of people are wondering about these templates you know how many templates can you have what if they need to change and if they do need to change how long do these changes take to apply [David Freeland] Yeah so the number of templates is very fluid I wouldn't necessarily be too concerned about our number because templates can vary in difficulty so that's the thing we can do on a case-by-case basis if they change again it depends on how complex the changes I imagine a lot of changes can vary in difficulty so that's again unfortunately one of those things we do tackle on a case-by-case basis and the length of time is again it's one of those unsatisfying answers potentially but yes we want to make sure we're working with you and the length of time it takes us to complete the projects is going to depend on your needs [Dan Tuleta] Great all right that's just one thing to keep in mind is that all of these are custom Solutions so that's going to be sort of a theme throughout the the FAQs we really do need to engage with you and your team on a you know one-to-one basis and understand your documents your templates and all of your needs to come up with a custom solution that will fit all of your documents so David can the system run multiple templates simultaneously [David Freelan] Yes that that's really not a problem running an arbitrary amount of templates is not an issue for us we just need enough processing power to do it and we don't think it's going to take that much processing power to do so it's really really not a problem for us [Dan Tuleta] Great all right and so David of course a very important question is this process fully automated or does it need like any sort of human hand-holding or babysitting on a day-to-day basis [David Freelan] It is absolutely automated it's just a matter of setting those initial conversations and creating those templates and from there it should just it should just go smoothly [Dan Tuleta] Great yeah once we've had those initial discovery calls and we better understand your templates your documents and we can develop those templates we can take the human involvement completely out of the equation and it all just works like magic so another question that we're often asked is should this be an on-prem or a cloud-based type of solution so in talking with our developers both options are available so we can we can set up this system either in the cloud or a locally installed on your servers our developers have recommended for this type of system that on-prem is going to be faster because you won't have data that has to you know ping back and forth from the cloud going both directions so everything can stay local into in your own environment and also this might help you meet some of your internal security requirements so a lot of these documents might contain you know client data or you know information about specific people or account numbers for example so a lot of there might be a lot of very strict and rigorous internal IT security measures that you have to adhere to so keeping it locally installed on your server in your environment might be a better option but if you are interested in having a conversation about how to install this on the cloud that's absolutely fine we can we can support either option now David what kind of volume and scalability are supported here? [David Freelan] It's it's arbitrarily large like you can have a billion documents going through and we can handle it you just need the hardware to do it but we don't exactly expect you to have to rent a supercomputer or anything to do that sort of thing but it's again one of those case-by-case things so we'll work with you to to determine on exactly how big of a machine you might need [Dan Tuleta] Right there can there can be a lot of variance between you know the size of the client whether you're a you know a small Credit Union or a small utility company or a you know a Fortune 500 company you could be talking about thousands of documents a month or maybe millions of documents per month but we can scale accordingly it really is just dependent on the hardware that you are trying to run this through but that again is a conversation that we can have offline once we're having a one-to-one discussion with you and your team now another question that we're often asked is, is it secure? And yes it is secure so whether you go with the cloud option or the on-prem it is in fact a very secure system so many people might opt for the on-prem install just for that extra layer of security keeping everything installed locally on your own environment to adhere to all of your internal IT security rules and regulations but just keep in mind when you take everything away from this from this presentation that this is in fact a secure solution and we can have a more detailed discussion about any of your specific requirements when the time comes thank you oh I didn't even notice you changed the slide David thank you so what what does it cost this is of course the million-dollar question? So everyone wants to know what will this cost and it really can vary depending on the complexity of the templates the number of templates that you have the volume and the hardware that you are running it on so there's a number of factors that go into the price and we know that this has to scale up and down you know based on a lot of different factors so what we want to do is of course have a conversation with you to talk about your exact needs talk about your documents better understand these the situation for your organization and then we will come up with a custom pricing solution that will work for both parties so this has to be kind of discussed on a one-to-one basis there is no just you know one size or one price fits all type of solution here everything is custom keep that in mind so in summary that kind of wraps up our FAQ. So in summary Equidox batch processing is a customized service for our clients to produce fully accessible documents in a fully automated way without any day-to-day human involvement so this allows for on-demand accessible PDF creation so that when any of your customers download a PDF whether it be their bank statement or a utility bill or any sort of templatized document Equidox batch processing will automatically produce that document in an accessible format so our batch processing solution is a fully secure fully scalable solution to meet your needs whether you're producing a few thousand or a few million documents this system will take the burden of page-by-page remediation out of your hands it will also ultimately ensure that there is inclusion for all of your clients and it will mitigate your legal risk of complying with all of the accessibility requirements for your organization For more information about how Equidox Software Company can help you with PDF accessibility Email us at EquidoxSales@equidox.co Or give us a call at 216-529-3030 Or visit our website at www.equidox.co
Dan Tuleta hosts a special guest, David Freelan, an Equidox data scientist, to talk about high-volume solutions. David will discuss his artificial intelligence developments and how these can make bulk PDF remediation completely automated and accessible.
Speak with an expert to learn how Equidox solutions make PDF accessibility easy.