Automatic Classification of Provisions

In the chapter “Automated Classification of Norms in Sources of Law” of the Book “Semantic Processing of Legal Texts. Where the Language of Law Meets the Law of Language”, Emile de Maat and Radboud Winkels described their research in this field, which attempts to achieve automated support for modelling sources of law for legal knowledge based systems and services. The authors say that “many existing systems use models that do not reflect the entire law, and simplify parts of the text. These models are difficult to validate, maintain and re-use. We propose to create an intermediate model that has an isomorphic representation of the structure of the original text. A first step towards automated modelling is the detection and classification of provisions in sources of law. A list of different categories of norms and provisions that are used in Dutch legal texts is presented. These categories can be identified by the use of typical text patterns. Next, the results of experiments in automated classification of provisions using these patterns are presented. 91% of 592 sentences in fifteen different Dutch laws were classified correctly.”

The following is based in the work “Automatic Classification of Sentences in Dutch Laws” by
Emile de Maat and Radboud Winkels:

Types of provisions or sentences in legislation
This is an overview of the types of provisions or sentences there are in the legislation of several jurisdictions [following the work of E. de Maat and R. Winkels. Categorisation of Norms. In: A. Lodder and L. Mommers (Eds.), Legal Knowledge and Information Systems. Jurix 2007, pages 79-88. IOS Press, 2007.1]. The ‘Introduction’, ‘Conclusion’ and ‘Appendices’ are relatively unimportant for most uses of legislation.

Introduction
Body: which comprises the Definitions (Procedures for citizens, Core rules, Procedures for civil servants) and the Rule management
Conclusión
Appendices

The other types together form what we have called the ‘body’ of the law. Important for most uses are the so called ‘core rules’ of regulation.

The ‘core rules’ and the procedures make use of vocabulary that is partly specifically defined in the law, the ‘definitions’. Finally, there typically are auxiliary provisions to fit all others in the legal systems as a whole, e.g. an enactment clause.

We are interested in the body of the document, and would like to be able to classify the sentences that appear in the document according to their meaning. For several types of sentences, we can distinguish certain signal words, certain patterns, which tell us what kind of sentence we are dealing with.

However, a problem occurs when we encounter obligations and prohibitions. Though these are sometimes expressed with words like “should” or “must”, most of the time a ‘statement of fact’ is used. The Dutch official guidelines strongly advise legislative drafters to use this form, and to avoid words like “must”.

This means that the text does not state what must happen, but instead simply states that it happens. For example:
Funeral Act, article 46, sub 1 No bodies are interred on a closed cemetery.

Thus, there is no pattern to be found either. Because of this, we hope to identify this important group of statements “by default”: if we can identify patterns for everything else, we may assume that anything not classified by these patterns is one of these statements of fact.

As for the other sentence types, it was found that the general type of the sentence could usually be derived from the verb phrase of the sentence. In earlier experiments,
the authors used more elaborate patterns, consisting of the verb phrase with some other words appended, but it would always turn out that these patterns were too restrictive. For many applications, when dealing with verbs, a stemming algorithm would be employed to deal with different inflections for tense, person and number. However, in legislation, the tense does not vary a lot (with present simple being used most of the time) and the rules always use third person (but the introduction and the conclusion which contain the directions from the King or Queen, at least in Holland, will employ first person plural, the majestic plural). As such, the authors deemed it unnecessary to employ something more complex than a simple pattern recogniser.
3. Patterns
Earlier research from the same author [E. de Maat. Natural Legal Modelling. Master’s thesis, University of Twente, Enschede, 2003] into the types of sentences that occurs in laws formed the basis for the patterns, but the patterns needed to be extended and studied with 20 other laws.

3.1. Definitions and type extensions
In a definition, a description is given of the terms that are used in the legal source. The construct that is used for a definition in Dutch legal texts is: by x is understood y, which gives us a clear pattern to identify definitions by. Type extensions are added definitions, which expand or limit an earlier definition, using the same verb phrase, but with the inclusion of the word also or not. In the earlier research on the Income Tax Act 2001, the patterns used were: x is y, or: x are y. So far, however, our research indicates that the Income Tax Act was somewhat unique in its use of those patterns, and we have not included them here (meaning that this classifier would not work well on the Tax Income Act). Should it turn out that these patterns are more widely spread, a more advanced classifier would be necessary to distinguish between definitions using this pattern and statement of fact sentences that merely use is or are as an auxiliary verb.
3.2. Deeming provisions
Deeming statements are sentences in which a given situation is said to be considered as if it where another situation. Thus, if situation A is deemed equal to situation B, then all rules that apply to situation B apply to situation A as well. In this way, definitions can be extended to cover certain special, exceptional situations. Deeming provisions can be recognized by the pattern: is deemed to.
3.3. Norms
In preparation of this project to classify all sentences in a legal text, research was conducted to determine how to distinguish between norm sentences denoting a right and those sentences denoting a duty [M. Franssen. Automated Detection of Norm Sentences in Laws. Twente Student Conference on IT, 2007]. This work yielded a large amount of patterns that were incorporated in the classifier for this project. The main conclusion was that almost 80% of the rights could be identified by the verb: may or the phrase: is qualified. A host of smaller patterns accounted for the remaining rights. Sentences denoting duties usually did not follow a pattern; 80% of these sentences was a statement of fact (as described above).
3.4. Application provisions
Application provisions are sentences that specify cases in which some other legislation (usually an article or subsection of an article) does (not) apply. In this way, additional conditions are added to existing norms. In case of an application provision that states that the other legislation does not apply, the application provision does, in fact, state an exception to that rule. An application provision that states that another piece of legislation does apply often seems to be in place to take away any doubts as to whether it ought to apply or not. The patterns used by these sentences are: does apply and: does not apply.
3.5. Penalization
Phrases may also specify some penalty that will be incurred if a norm is violated. In Dutch law, this is usually done through sentences like: will be punished with. In general, these phrases will be followed by a provision that denotes whether the punishable fact is a crime or a misdemeanour.
3.6. Value assignments and changes
Value assignments are used to give a value to a certain term in the text. These values can later be changed. These sentences express the formula used to calculate some value used in other sentences.
Income Tax Act 2001, article 3.3, sub 1 Taxable wages are wages reduced with the employee’s discount.
These sentences use a range of mathematical operations (to reduce, to increase) and comparisons (at most) which in combination with the verb to be or to amount to can be used to detect them.
3.7. Lifecycle
These sentences deal with the maintenance of the legal texts, keeping them up to date. Most of them deal with modifying existing legislation, by adding new text, modifying text or deleting/repealing text. In addition to the sentences that express such changes, there are sentences that determine the enactment date of a legal source. Most laws include one such sentence, in which they determine their own enactment date. Another type of lifecycle sentence is the citation title designation, in which a (shortened) official title for the source is set. This usually appears at the end of the law text.

4. Experimental Set-up
We built a classifier (in Java) that takes well structured legal sources as input and tries to classify their sentences according to their type based on typical patterns associated with these types. The types and their patterns were described in the previous section. In total, we used 81 patterns from about twenty Dutch laws, consisting of verb phrases, often with some keywords added. Most patterns consisted of one to three words.

For this experiment, an assumption was made with regard to sentences with an embedded list, such as:

Tobacco Act, article 1: In this law, and in the stipulations based on it, is understood by:
a. tobacco products: … ;
b. Our Minister: …;
c. appendix: …; …

Here, we assumed that classification can be based on the first part of the sentence, and that the list items are not needed for the classification. As input for our classifier, we used documents tagged in MetaLex, in which both sentences and lists were marked.

A Simplified MetaLex example with quoted element may be the following:

(Sentence) The first member will become:
(quote)
(subpart)
(index)1.(/index)
(sentence) Rules concerning affairs of the Kingdom… (/sentence)
(/subpart)
(/quote)
(/sentence)

To check whether clauses were classified correctly, all sentences and lists in all laws were also classified manually.

5. Results
In case of a law that changed an already existing law, the elements that are changed, repealed or inserted are marked as so called ‘quoted’ elements within MetaLex. The classifier also classifies the sentences and lists within these quoted elements. In the example above a simplified MetaLex structure is given of the use of such a quoted element. The classifier will both try to classify the sentence stating the change and the sentence that will become part of the altered law (“Rules concerning…”).

The first thing to notice is that the classifier performs very well, 94% of all sentences and 89% of all lists are classified correctly.

Norms obviously take the greatest part of all sentences in the laws we used: The explicitly recognized ones plus the default make out 58% of all phrases. The second largest category forms the so called ‘change’ class. These are the provisions that change some existing law; 33% of all sentences and lists belong to this class. The classifier even further specifies this type.

One sentence contained two changes: renumbering and a repeal. It is listed in Table 2 as ‘Mixed type’, and is not listed in Table 3. It used a specialised pattern that had not been added, rather than two patterns combined. (It may be better to classify this sentence as a separate type than to consider it a mixed type sentence.) Five of the six false positives of the “repealed” type sentences were provisions concerning the repeal of fines instead of articles. This will require more sophisticated patterns or dedicated ‘anti-patterns’ (i.e. not applicable when it contains the word
‘fine’).

The two false penalizations are in fact both a right; the pattern that triggered this classification was part of a qualification of a legal body that was given certain rights. We will need a more sophisticated parser to detect this use of the pattern. The false rights and false application statements (twelve in total) have all been misclassified because the pattern for right or application statement appeared in an auxiliary sentence. The authors only encountered one value assignments in the texts classified during this experiment. These seem to be specific to certain domains (i.e. taxes), and perhaps they are usually deferred to lower order regulations.

The authors also found only one citation title and no deeming provisions at all. The authors assumption that all lists could be classified by their first sentence does not always hold.

Conclusions and Discussion
The classifier works very well, at least on the set of Dutch laws used in this experiment; 94% of all phrases was classified correctly and there were hardly any false positives. Almost 60% of all phrases was classified as some type of norm, a further 33% as clauses changing an existing law. A necessary pre-condition for the classifier is that the structure of the documents has already been marked. For most modern corpora this will not pose a problem, as they usually have been marked in such a way. For legacy texts, however, an automatic structure recogniser would be desirable. If the text is marked by hand, it is relatively easy to also classify it, and the gain from using the classifier will be rather small.

Despite these very positive results, there is of course room for improvement. An important threat to the accuracy of the classifier seems to be the occurrence of patterns in auxiliary sentences. Franssen [see above] has suggested that this problem may be solved by a smart ordering of the patterns. By giving those patterns that may appear in the auxiliary sentences a lower priority, the chance is increased that a sentence is classified based on its principal sentence. However, this will still leave room for error.

It would be preferable if somehow, the sentences in the input were split is principal and auxiliary sentences. This classification is merely the first step in a larger process to create a
model for these sentences, and the distinction between principal and auxiliary sentences will be of use in a later stage as well.

Another point of improvement is the granularity of the current classifier. With regard to the normative sentences, the classification is very coarse, with only two categories: Right/Permission and Obligation/Duty (the “Statement of Fact” category displayed in the results also denotes obligations/duties). The authors of the research suspect it will be possible to make more distinctions.

Especially with regard to the norms of competence, it seems that there are several standardised constructs being used within the Dutch laws. Will these results generalize to all Dutch law and possibly other jurisdictions? A pattern-matching approach often lacks generalization capabilities. Although languages do have underlying rules, people will often stretch and bend these to their need. This means that a system based on rules is often too rigid to deal with all the variation that can occur [C.D. Nanning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (MA), 1999.]

Therefore, a statistical approach is often advocated [C.D. Nanning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (MA), 1999 and M.-F. Moens. Innovative techniques for legal text retrieval. Artificial Intelligence and Law, 9(1), pages 29-57, 2001]. However, the amount of variation in legal texts is restricted, as legal drafters will seldom use a complete new style, instead using the style of older laws or the official guidelines. Our patterns were gathered from a completely different set of laws than the one we tested them on. The success rate of 94% strengthens us in the idea that it should generalize to all relatively new Dutch legislation. We will need to add patterns for older legislation and possibly for certain specific types like Tax Law.

Likewise, there is no reason to assume similar success with different patterns could not be achieved for other jurisdictions. There too probably the language used in legislation will be restricted and contain typical patterns. In the experiment of C. Biagioli, E. Francesconi, A. Passerini, S. Montemagni and C. Soria (Automatic semantics extraction in law documents. In Proceedings of the Tenth International Conference on Artificial Intelligence and Law (ICAIL ’05), pages 133-140, ACM Press, New York, 2005) it is described for Italian law in which machine learning techniques (SVM) are used to classify paragraphs of law texts. They achieved an average of 90% correctness in classifying 582 paragraphs (provisions) into 11 types or classes. Their set of laws contained more ‘change’ type of sentences (50% as opposed to our 33%) and only 15% norms, but they do not mention the ‘statement of fact’ phrasing for normative expressions. They also have a large number of what they call ‘penalties’, the dutch authors ‘penalization’ (20%) which leads them to suspect they used penal law as a domain. Based on this other study it is tempting to conclude that a pattern based context free grammar works as good in classifying sentences in legislation as a machine learning approach. This is due to the simplicity and consistency in use of the patterns the authors of this research found.

In the experience of the authors, in the time needed to construct a training set for machine learning, probably all relevant patterns are already found. Of course, a definite conclusion cannot be drawn due to the difference between the domains of the Dutch study and the Italian study. A future extension in the classification approach may be to use statistical data to choose between competing classifications.

T. Gonçalves and P. Quaresma (Is linguistic information relevant for the classification of legal texts? In Proceedings of the Tenth International Conference on Artificial Intelligence and Law (ICAIL ’05), pages 168-176, ACM Press, New York, 2005) suggest to move the other way, adding more linguistic data to improve a machine learning approach to classification (again using SVM). However, the authors will first see how far we can get by smart ordering and parsing techniques distinguishing principal and auxiliary sentences as described above.

References
[2] E. de Maat, R. Winkels and T. van Engers (in press). Making Sense of Legal Texts. To appear in G. Grewendorf & M. Rathert (eds), Formal Linguistics and Law. Mouton, De Gruyter, Berlin, 30 pages. Series for the volume Trends in Linguistics – Studies and Monographs (TiLSM). [3] E. de Maat, R. Winkels, and T. van Engers. Automated detection of reference structures in law. In T. van Engers (ed), Proceedings of JURIX 2006, pages 41-50, IOS Press, 2006. [7] T.M. van Engers, P.J.M. Kordelaar, J. Den Hartog, and E. Glassée. POWER: Programme for an ontology based working environment for modeling and use of regulations and legislation. In: Tjoa, Wagner and Al-Zobaidie, editors, Proceedings of the 11th workshop on Databases and Expert Systems Applications (IEEE), pages 327-334, Greenwich London, 2000.

Automatic Classification of Provisions

Automatic Classification of Provisions

Comments

Leave a Reply Cancel reply