Sanchay News

संचय समाचार

Archive for the ‘Language’ Category

23rd February, 2011

leave a comment »

A new avatar (no connection to the movie) of Sanchay is going to appear soon.

The first Sanchay Online Application is on the way. Like most of the major parts of Sanchay, it was done in three-four days.

So that’s why there are so many bugs?

Watch this space (as they say)…

(Also note that the latest builds now will be available here, instead of here.)

10th April, 2010

leave a comment »

Some more information about the task mode of operation for the Annotation Interfaces (not yet implemented for the Parallel Corpus Markup):

  • The annotation process starts with one copy of the document to be annotated, say, story-1.
  • When an annotator (say, john) claims this document (opens and saves it in the task mode), another copy is created, with the file name story-1-john. This is the file on which the annotator will be working.
  • However, the task name (e.g. story-1) will apply to all the copies.
  • When another user (e.g. terry) has claimed the same task, another copy will be created (story-1-terry).
  • Now the adjudicators can use the annotation comparison facility to compare the two annotations.
  • The adjudicators can select one of the two annotations and make changes to it directly from the comparison facility.
  • When an adjudicator (say, chapman) saves the selected and modified version of one of the annotations, it gets saved with the original document name (story-1), i.e., the original document is overwritten.
  • An option worth considering is whether, instead of overwriting the original document, another copy should be created with the name of the adjudicator (story-1-chapman). But what if the adjudicator is also one of the annotators? That shouldn’t be a problem, because it can’t be true for the same document.
  • However, the copies of the work by the two annotators remain available for more work by the annotators or for any other use later, such as calculating inter-annotator agreement.

That brings me to the facility I will try to add soon: calculating inter-annotator agreement for different levels of annotation.

Also, note that the task mode of working is not yet available for sentence alignment and word alignment interfaces. That is another item on the agenda.

9th April, 2010

leave a comment »

Version 0.4.1 released on

There are the following major changes/additions in this release:

  1. Propbank Annotation is a separate application now, instead of being a mode of operation in the Syntactic Annotation Interface. The two letter code is PB.
  2. Both the Propbank Annotation and the Syntactic Annotation interfaces can now work in two modes: file mode and the task mode, which can be specified in the the file Sanchay/props/client-modes.txt. You can change it according to your needs.
  3. In the task mode, the dormant but previously active facilities of annotation comparison and task generation are available. They are quite simple and should be easy to use.
  4. In the task mode, new facilities include the ability to specify two users and two adjudicators for every annotation task. In the beginning, all the tasks (which have been created) are available to everyone, but as soon as two users claim them (by opening and saving them), they become unavailable to others. Only the adjudicators can use the comparison (annotation diff) facility. Look for a file like Sanchay/workspace/syn-annotation/Premchand/Premchand-list.txt, which lists the tasks belonging to the task group named Premchand (one of the greatest Hindi writers). There is a task properties file for every task in the task group, which specifies the details used by the annotation interface. Generating tasks means specifying the files to be annotated and automatically generating the task properties files and the task list file.

If you intend to use the task mode of annotation, which is what I would recommend, then you should try to use the Task Generation facility. You can access it in the task mode, when you select the Setup mode (the other modes being the Work mode and the Compare mode). Once you go through a task list file, such as the one mentioned above, and the task properties files generated in the Setup mode, you will get a fairly good idea about how the task mode operates.

The assumption for the task mode right now is that Sanchay will be located on one computer (preferably Linux: I am not sure whether it can work on Windows for multiple users) and that computer will have accounts for every user who is going to be involved in the annotation process. To make sure that Sanchay is accessible to all these users (write permissions are needed for some properties files), one simple way is to create a group and give read and write permissions to that group for the whole Sanchay directory, except the files you want to have restricted access, e.g. the task list file and client mode file.

2nd April, 2010

leave a comment »

Update 1.2. Released only here. Includes the following changes:

  1. In the Propbank Annotation Interface (Syntactic Annotation in the Propbank mode), it possible now to navigate by word stem and tags, for which files have to be provided. The default location of these files is Sanchay/workspace/syn-annotation. The file for word stem is named word-navigation-list.txt and the one for tags is named tag-navigation-list.txt. The navigation will work based on the word stem plus tag combination. For example, you can annotate all the main verbs (tag regex ‘”^VM”) with the stem कर (kara: do).
  2. In the Propbank mode, some of the dependency structure information will be hidden so that Propbank annotation can be performed without being biased by the dependency structure information and also to ensure that there is less chance of changing this information by mistake. You can still view the dependency tree in the tree visualizer and change it by drag and drop, but that will be fixed in one of the following updates.

In the previous (1.1) update, some further changes were made in the tokenizer so that it shouldn’t split sentence in the case of bullet points (in the case where the bullets are decimal numbers).

29th March, 2010

leave a comment »

A couple of bugs fixed in UTF-8 to WX (and vice-versa) conversion. The modified jar file has been uploaded on

28th March, 2010

leave a comment »

Update 1 released for the Sanchay version 0.4.0 on

This is a minor update, so you can just download the jar file and replace the old one (in the Sanchay/dist directory) with this one.

There are the following changes/additions in this release:

1. In the Syntactic Annotation interface, when you open a raw text file (simple text without any annotation), the tokenizer that runs will now take into account the old Hindi/Sanskrit sentence/shloka/verse marker (e.g. ।। 1 ।।).

2. You can now convert the encoding of a syntactically annotated file, not just that of a simple text file. This facility has been connected to the Syntactic Annotation interface. (Click on the More button after opening a file).

18th March, 2010

leave a comment »

I have released the binary for the 0.4.0 version of Sanchay on Sourceforge. I will release the source code later.

The file released is a zip file containing complete Sanchay. For the next few minor releases, I will only release the jar file (and source code, if possible). The jar file in the current zip file can then be replaced with the new jar file to update your copy of Sanchay from 0.4.0 to, say, 0.4.1. For very minor changes, I will put the jar file on

The releases should also be available from the previous location for latest builds.

16th March, 2010

leave a comment »

A bug fixed, related to navigation from the result of query processing of multiple files. As I am not able to access the usual site where I put the latest versions, I am putting the jar file with the modification here:

Sanchay jar file only (16th March, 2010)

You should also be able to see the widget on the side panel on the right.

I am not able to put the complete zip file as the site account that I have allows files only up to 25 MB.

15th March, 2010

leave a comment »

The Sanchay Home has now moved to Also, the Sourceforge UNIX name of Sanchay is now sanchay, instead of nlp-sanchay. Thus, the Sourceforge project page is now, instead of Due to this, you might find that some links are not going where they went earlier, especially if you are coming from search engines (the Google page shows the old link). I have updated in many places (as have Sourceforge people on their side), but there might still be problems in some places. But if you start from the new Sanchay Home, most of the things should be accessible.

If you are one of the users of Sanchay, it would be helpful if you let me know and (even better) send some feedback, list of bugs found, suggestions for new facilities, suggestions for modifications etc.

News Till Now

leave a comment »

Things have been happening on the Sanchay front and hopefully that will continue for a long time.

One of the problems (perhaps the biggest) is that there is no documentation. But I still can’t afford to seriously address that problem as I have to finish some urgent work. However, for some time now I have been posting updates about changes and additions to Sanchay here. Since I have got into the good habit of doing this, I will now post the updates using the medium I have become comfortable with.

As a result, I have set up this blog to post any new information about changes and additions to Sanchay. In this first post of the blog, I am going to post all the updates till now from the time I started posting them:

10th March, 2010: Couple of bugs fixed in query processing. Current directory will now be saved for the multiple files mode too.

26th Feb, 2010: Two more additions to the SCQL:

  • Return Values: On the RHS, you can specify return values by using the node symbols and the dot notation (e.g., C, N.A). For this purpose, another symbol S can be used to return sentences for the nodes which match. The syntax is intuitive and easy to remember: If you don’t provide an assignment value, then the node address is treated as the return value. One restriction at present is that only node addresses can be return values.
  • Destination: You can save the matched nodes or sentences in a file that you specify.

An example of extracting sentences which have the dependency relation ‘pof’ for the verb stem कर is given below:

A.a['drel']='pof' AND N.a['lex']='कर' -> S := ssf:qr.txt:UTF-8

Note that there is also a new ‘destination’ operator :=, which might be followed in the query by a specification of the destination in terms of the format (ssf, bf or bracket format, pos or POS tagged and raw or simple text), the file path and the encoding, which are separated by colon. Optionally (e.g. when working from the GUI) you can leave the destination specification blank:

A.a['drel']='pof' AND N.a['lex']='कर' -> S :=

In this case you will be asked for the path of the file and the charset (encoding). The same will apply for querying multiple files (which right now works only from the interface). In the multiple files case, the path for the destination file will be asked first, followed by the paths of the source files.

You can provide multiple return values, but that will not work at present for the case when you also want to use the destination operator (it will work otherwise):

A.a['drel']='pof' AND N.a['lex']='कर' -> P and C and N

The above will return the current, the previous and the next node.

You will notice a couple of differences in the results table that is displayed. The navigation works in the same way.

14th Feb, 2010: One more wildcard added: . (the last one to match). A bug in the processing of wildcards and ranges fixed. Wildcards are now inclusive, whereas ranges are exclusive. You can use the M symbol on the LHS as well as the RHS. Also, you can use wildcards and ranges on the M nodes too. As an example, you can write the following query to transform the XC XC … NN kind of sequences (mentioned in the previous updates) to NNC NNC … NN:

P[*].t/p='XC' and C.t!='XC' -> M[p/*].t=C.t+'C'

A couple of points may be noted. First, the match alias (p) comes after the separator (/) when it is assigned to a condition, and it comes before the separator when it is used to access the matched nodes. Second, the meaning of the wildcard * is any for the condition nodes (A, D, N, P and T) and all for the matched nodes (M). Wildcards are not applicable to nodes which are nesessarily single (C, R).

Similarly, to mark predictable karta relations:

C.l='ने' AND C.f='t' AND A.N[?].t/q='VGF' -> A.a['drel']=''k1':M[q/?].a['name']'

I guess the basics of the language are now in place. There are some other things that I plan to do, but for some time I might shift to providing a way to use the language from the command line.

Language-Encoding Identifier: A shell file added to run the language-encoding identifier. To run it on your data, you will have to modify the shell file slightly by providing the path to a directory or to a file (whose language is to be identified) in place of the string ‘data/enc-lang-identifier/testing/Hindi-UTF8’. Language identification will be performed on all the files in the directory if you give a directory path. The output is simple: the file path and the guessed language-encoding. You can also change the other arguments if you know how to. If you don’t, you can ask me.

13th Feb, 2010: Some major additions:

First, support for some wildcards added. At present, there are two wildcards: ? (first one to match) and * (any/all). I forgot to add another obvious one, but I will do that soon.

Second, support for ranges added. Ranges could be specified with , e.g. N[2-4], P[-4], A[2-] etc.

Third, you can now access the matched nodes (other than the current node C on the RHS through a new node type M). If there is only one condition, you can access the matched nodes without assigning an alias (the tentative separator for providing an alias is /), just through the symbol M, e.g. M.t=’NN’. Otherwise, you can provide an alias for every condition and access the matched nodes through that alias. For example:

C.t~'^V' and N.t/p='VAUX' and A.a['num']/q='pl' -> M[p].t='AuxV' and M[q].t='VPP'.

Note that since aliases work like indices, not names, so they are written without quotes. Names and values could be have to be evaluated (as they might contain variables added through the device of concatenation), but not aliases.

See a more practical example below.

You can use the matched node symbol M in the same way as other node symbols, i.e., you can use the dot notation to get to the other nodes in the context of the matched nodes on the RHS (e.g. M.N[2].t). In principle, you should even be able to do this on the LHS too, though I haven’t tried that yet.

Fourth, you can search and navigate even in the multiple file mode. When you click on some match, the respective file will be opened in a new tab and you will be taken to the corresponding sentence. The matched nodes (‘current nodes’ for the query) will be highlighted. I will provide a facility to read queries from a file and run them on multiple annotated files, which will be useful for purposes such as performing a quick sanity check on the annotated data.

Some bugs fixed, so a couple of things that were not working earlier should work now (such as the XC XC NC example in the last update).

An example to use wildcards to mark the predictable karta (agent) relations:

C.l='ने' AND C.f='t' AND A.N[?].t/q='VGF' -> A.a['drel']=''k1':M[q].a['name']'

There might be some change in the way wildcards and ranges work. At present both are inclusive. But it will be better to have ranges working in the exclusive mode (either all match or none) and wildcards working in the inclusive mode (even if some don’t match, others can).

7th Feb, 2010: Concatenation operator (+) added. One example of its use is for coverting sequences like XC NN or XC NNP to NNC NN and NNPC NNP, respectively:

C.t='XC' and P.t!='XC' and N.t!='XC' and C.f='t' -> C.t=N.t+'C'

For a sequence of length three (e.g. XC XC NN), at present you will have to write this:

C.t='XC' and P.t!='XC' and N.t='XC' and C.f='t' -> C.t=N[2].t+'C' and N.t=N[2].t+'C'

Once support for wildcards has been added, this should become simpler.

Two commands have also been added that can be used to reallocate the node ids and to reallocate unique names:

  • ReallocateIDs
  • ReallocateNames

The commands are processed in the same way as the queries. The second command above can be used to generate names before you start marking the dependency relations. A lot more commands will be added later.

Another changes is that the queries are now case insensitive, except for regular experession.

The query language is mostly independent of the annotation scheme, i.e., it is not restricted to the scheme based on the Paninian Grammar. The tags and attribute names used in the examples, however, are for this scheme because most of the people that I know who are using Sanchay for annotation are working with this scheme. Both Sanchay and the query language can be easily adapted for other annotation schemes. Moreover, we will soon be shifting to a purely XML based format. Nor are they restricted to any specific languages. Some tools in Sanchay only work for some languages, but they can be made to work for other languages if the required data, rules etc. are available.

6th Feb, 2010An example for marking dependency relations that are very predictable is as follows. Suppose (for a Hindi corpus) you want to mark – every NP chunk that contains one of the words (post-positions or vibhaktis) का, के, की and is followed by another NP – with a genitive relation (r6) to the next NP, then you can write this trasformative query:

C.l~'^का$|^के$|^की$' AND C.f='t' AND A.N.t='NP' -> A.a['drel']=''r6':A.N.a['name']'

This will work only if unique names have been generated before this query is applied. I will write in the next update about how that can be done with the same query mechanism.

Be careful about the use of quotes to differentiate the literal part of the value (‘r6’) from the part that has to be evaluated (A.N.a[‘drel’]). Also note that there is no space before or after the colon (:).

You can also notice one major change in the notation: square brackets are used now for indices (whether integers or keys) and parentheses are used instead for nested conditions as mentioned in the last update.

5th Feb, 2010: Some more extensions to the SCQL. It is now possible to give queries using nested conditions, combining AND and OR operators. Also, two essential operators added: not equal (!=) and not like (!~). The corrected query (from the last update) for tag validation would be:

C.t!~'^NN$|^JJ$|^VM$|^PSP$' AND C.f='t'

An example of a nested query is:

((C.t~'^N' OR C.t!~'^V') AND C.f='t') OR (C.l~'है' AND C.t~'V')

Queries on multiple files should also work. Note that currenty the transformations happen in-place and no backup file is created. That should change soon. Also, at present there is no nesting on the RHS (as the meaning of nesting on the LHS and RHS is different), but I am working on that.

1st Feb, 2010: Facility to query documents in the Syntactic Annotation interface extended. (The query language has been christened, quite unimaginatively, as the Sanchay Corpus Query Language or SCQL). Initial support for transformations added. Two other kinds of nodes added: R (referred node) and T (referring node). Since referring nodes can be more than one, they can be accessed through indices, e.g. T(1) or T(2). The referred node can be accessed by providing the name of the attribute through which it is referred, e.g. R(‘drel’) would give the node that is referred to by the current node with the attribute ‘drel’ (as in drel=’k1:NP1′). You can also check whether a node is a leaf node by writing something like C.f=’t’. The two possible values of .f are ‘t’ (true) and ‘f’ (false). You can query the level of a node in the tree by writing something like C.v=’2′. Another important addition is that if you use the symbol ~ instead of =, the values will be treated as regular expressions.

Thus, to replace one tag with another, you can write:

C.t~'^NST$' -> C.t='STNoun'

Multiple transformations can be performed by using the AND operator on the RHS (Right Hand Side), including on nodes other than the current node by using the same notation as for the LHS. But it would be advisable to try first on some sample data as this facility has not yet been tested by others and I have only tried some simple things.

Queries can be written to perform a sanity check after manual or automatic annotation. More details about that later, but as an example, you can check for invalid POS tags by writing a query like:

C.t~'^NN|JJ|VM|PSP$' AND C.f='t' (Not correct: see the next update)

…assuming that there are only four POS tags as listed above. Attribute values and chunk tags can also be validated in a similar way.

Yet another addition is that you can first convert the (chunk) tree to dependency tree and perform the query on that by prefixing the query with DS: as below:

DS: C.v='0' AND D(2).t~'.*'

For a document this will return all the sentences (root nodes) where only partial dependency annotation has been performed (‘hanging nodes’). Note that this won’t return sentences where dependency annotation has not been started at all. For this latter case, you can use this query:

DS: C.v='0'

This will return all nodes on which some or all dependency annotation has been performed. If any sentence is missing in this list, annotation on that sentence has not been started at all. (But you can write a more directy query by using the != operator as mentioned in the next update.)

To check for nodes outside any chunks:

C.v='1' AND C.f='t'

And to check for the presence of any nested chunks:

C.v='2' AND C.f='f'

27th Jan, 2010: Facility to query documents added to the Syntactic Annotation interface. For example, to find all nodes with the lexical data as लिए such that the previous node has the tag PSP you can write this query:

C.l='लिए' AND P.t='PSP'

The results returned will include the matched node, its parent node and the node referred by it (where applicable) for dependency relations, e.g. if you search for a(‘drel’)=’pof’, you will also get the node to which this ‘pof’ relation points.

Similarly to find nodes with the lexical data as ‘के’ and the parent (usually the chunk) having the tag NP, the query can be:

C.l='के' AND A.t='NP'

The following keys should be helpful till I add more information: C (Current node), P (Previous node), N (Next node), A (Ancestor node), D (Descendent node), l (lexical data or the word), t (tag) and a (attribute).

A query involving an attribute is:

C.l='के' AND A.a('cat')='n'

Note that the attribute names (like ‘cat’ above) and literal values (like ‘के’) must be enclosed in quotes. Values not enclosed in quotes must be node addresses using the above notation. Some limitations of this new implementation are: no nested queries, no wildcards or regular expressions and no ranges. Some of these will be removed soon. But you can something like A(2) (grandparent node) or D(1) (the first child) or N(2) (the node next to the next node). A(1) means the same as A and similar is the case for N and P.

25th Jan, 2010: The statistics facility in the Syntactic Annotation interface extended. The following statistics about the document being annotated are available now:

  • Number of paragraphs (if applicable)
  • Number of sentences
  • Number of words
  • Number of characters
  • Number of chunks
  • Number of POS tags
  • Number of chunk tags
  • Number of attributes
  • Number of attribute values
  • Number of attributes value pairs
  • Number of untagged words
  • Words and their frequencies
  • POS tags and their frequencies
  • Chunk tags and their frequencies
  • Word-tag pairs and their frequencies
  • Unchunked words and their frequencies
  • Lists of words for each tag
  • Lists of tags for each word

18th Jan, 2010: In the Syntactic Annotation interface, you can now navigate to a sentence by clicking on the table showing the search results.

17th Jan, 2010: A shell script to extract sentences based on surface similarity (for Hindi) added. Another script to run a Statistical Machine Translation (SMT) based transliteration tool added (for English-Hindi and Hindi-English). If you run the scripts without any arguments, you will see a usage description. Some bugs fixed.

7th Jan, 2010: Many internal changes which are not yet reflected in the GUI. Apart from them, the Sentence Aligner interface connected with an automatic sentence aligner that should work better for the English-Hindi pair than for other pairs. The main window now has the list of available input methods in a combo box, which should make it easier to switch between input methods (for typing in different languages). I will add the facility to set shortcuts for selecting input methods soon. A couple of other bugs fixed. In the menu you will see commands for connecting to a remote computer, but they are not working yet.

11th Dec, 2009: A new tool added for creating linguistic trees based on the X-bar theory (but it can be used for any kind of phrase structure trees and may be even other kinds of trees). There are shortcuts to add binary and ternary subtree, triangles, adjuncts etc. Features to nodes can be added by right clicking and selecting. The tags, features and their values can be easily customized. The trees created can be stored in the SSF format as a text file and can also be exported as jpg or eps images. Terminal nodes can be edited to add text by clicking on them and typing. Most of the tree creation can be done just by using the keyboard (if you feel comfortable with that, otherwise you can use the mouse). To edit the text in a terminal node, move to that node using the keyboard (or the mouse) and press spacebar (or double click) and type. Pressing Enter will complete node text editing. If the text is not displayed completely, just click on the + (Zoom In) or – (Zoom Out) button once.

9th Dec, 2009: The first working version of the Sentence Alignment interface completed. The input and output formats are the same as for the Word Alignment interface. The alignment positions are synchronized and no crossing edges are allowed. Many-to-one alignment is possible as long as there is no crossing.

7th Dec, 2009: Facility to edit and save shortcuts added (currently available for only the Frameset editing mode). A bug that was introduced after the last changes has been fixed. (The bug was related to dislaying the dependency trees in the Syntactic Annotation interface).

4rd Dec, 2009: The facilities to delete individual alignments and all alignments in a sentence pair added. Also added the facility to reset the alignments of a sentence pair to the values previously saved. To delete an individual alignment, just press the Ctrl key and redraw (drag-and-drop) the alignment. Another addition is that you can load data that is in GIZA++ format.

3rd Dec, 2009: The first working version of the Word (and Phrase) Alignment interface completed. Unlike the Parallel Corpus Markup interface, this one allows alignment by drag-and-drop. It is possible to group together words in the source and the target language sentences, tag them and align them. The layout is horizontal, unlike the Syntactic Annotation interface. Alignment can be many-to-many. The input file can be simple text (one sentence per line) or text file in SSF format. The saved files are currently in SSF format. The alignment information is stored in the SSF files through an additional attribute ‘alignedTo’. You can also try the initial version of the Sentence Alignment interface, but it won’t yet save the alignments.

24th Nov, 2009: A problem introduced due to the recent (15th Nov) changes corrected. The Sentence and Word Alignment interfaces are being redesigned to allow easier and faster manual alignment. They are not yet complete but should be: soon.

16th Nov, 2009: One bug fixed in syntactic annotation replace facility. The data for the Language and Encoding Identifier recompiled so as to work with the current version.

15th Nov, 2009: There are a few more components in this version, one of them being a fully functional charmap cum font viewer and another being a (manual) sentence alignment interface. Some bugs in the Frameset mode of syntanctic annotation have also been fixed. Also, this version brings a new mode of operating Sanchay, i.e., through the Sanchay Shell. A simple toolbar has been added to quickly start applications. In the Word List Visualizer tool, you can now use Surface Similarity to search for similar words (for now only for Indian languages, but to be extended for other languages). The future version are likely to focus more on the shell mode of operation, rather than GUI based applications, although those will also continue to be added. But since a lot of new things have been added, there may be some things which don’t work properly (as usual).

6th Oct, 2009: There is one more component integrated with the Sanchay GUI now: a verb frameset editor compatible with Cornerstone. Also, the Syntactic Annotation interface can now be run in frameset annotation mode (look for a checkbox below the tree area). The frameset editor is connected with the annotation interface (somewhat like Jubilee) such that new frameset files can be created while annotating if there is no existing file. The find and replace facility on the annotation interface has been improved too, though there might be still some issues. I will try to add keyboard shortcuts for more things soon so that annotation can be easier and faster. And, of course, documentation…