What is the classification problem in alfresco?
This is one of most common problems in alfresco, known to all of us who deal with alfresco. This problem is basically this
- How to set metadata of the node based on the contents of the document
Example of this problem is when we want to add an document that represents the invoice in the repository. Invoices can have more then a couple of properties, let try to list the basic ones
- Currency (EUR, Dollar, CHF …)
- Most definitely it is going to have a recipient ( invoice must be titled to some company right?)
- Invoice value ( $656 ) …
- It can have many more like contract number that this invoice is based on … and so on.
Common solution to this problem is that a person opens the document, finds the information and using the share form to fill the values. This is the “solution” of the problem.
Thinking out of the box
I am sure you heard about “thinking out of the box” every once and a while, so lets try to do that.
Here is the new invoice scenario
- Invoice is scanned and/or uploaded to a folder called Incoming documents
- This upload starts a workflow called Incoming Invoice
- Incoming invoice workflow can have several steps but one must be to fill the invoice metadata
- After the metadata filling invoice can to into pipeline to be paid…
If we focus on the third step ( filling the metadata) , in this step person needs to see the content of the document and fill the workflow form. This is of course not possible as alfresco does not show document and (workflow) form in the same window. Users needs to download the document (or open the file in the new tab), open it and then fill the properties. Thinking out of the box way is to ask developers to fix this by creating a custom page where user can see the workflow package documents preview and workflow form. Since I already did this I am just saying that it provides an awesome way to speed things up. Thinking out of the box par excellence. Or is it?
If you think out of the box that means that you are in the box, right? One should never be in the box in the first place.
Being out of the box
Because we are not in the box now, lets examine one example of the invoice.
Invoice Issued to: Company Co Yu, ID 1122334455 Account Number: 9988551-RS Date: 22/9/2015 Total: EUR 669.22 Contract No: 2015/18-01
In this example we can see that this is an invoice, issued to Company Co You, total EUR 669.22 and based on the contract 2015/18-01. If user would see this it would take him some time to fill the form, finish the task and open another one. If he would do this all day I guess it would take him a year to go crazy.
Solution that I am proposing to this problem is to use the big guns called machine learning algorithms. This algorithms are going to check the content of the document, scan for keywords and respond with correct values.
When this is integrated into alfresco, user would just need to check if all is ok.
Machine learning algorithms crash course
What are machine learning algorithms?
MLA is an algorithm that can study the existing data and after when new data arrives it can make predictions on missing information with near 100% accuracy.
There are algorithms that are supervised and unsupervised, difference is that supervised algorithms require in the process of learning to have data with results. In our case we would need to have a document that already has metadata. Hey we already have this!
Also we can separate machine learning by the problems that they are solving
- Classification is a problem where we need to specify the class for the entity based on the information that we have. Let say that day temperatyre is 30 degrees is it hot or cold. Hot and cold are referred to as classes .
- Regression is a problem where we need to understand the relationship between the properties.
- Clustering is a problem where we need to group entities together based on their properties.
- Association ruls, we see this every day when we listen music on youtube. It always finds something that we would like and that relates to the current song.
In our case we have a classification problem. Document needs to be given a class, class is the contract, currency or supplier…
There are more then enough algorithms that we can use, I will try to list few of them.
- Support vector machines
- Naive Bayes
Java implementation of all of this mentioned and other algorithms can be found in Weka.
What is Weka?
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature.
If you want to read more about Weka go here.
I have already integrated Weka into alfresco and would like to show you how to ask alfresco module to learn the classification and how to get prediction results.
Learning the classification and creating the model is done with one line. Lets assume that we have a bunch of invoices with aspect ab:invoice, all of the invoices have currency property named ab:invoiceCurrency.
This line of code will learn from the data and save the model in alfresco for future usage. We will not go into details now, this will be a part of the future posts.
If we recall the invoice workflow and just before user needs to fill the form we should add the create event on the user task.
<userTask ...> <extensionElements> <activiti:taskListener event="create" class="..."> <activiti:field name="script"> <activiti:string> <![CDATA[ var currency = scriptWeka.classify("invoiceCurrency",result); //TODO set currency value into the task property task.setVariable('ab_invoiceCurrency', currency); ]]> </activiti:string> </activiti:field> </activiti:taskListener> </extensionElements> </userTask>
When user opens the his task currency will be set for him/her. #awesome.
I have tested the solution on 2000 invoices and prediction almost instantaneous. In the future posts I will go into details so please subscribe.