Sanchay News

संचय समाचार

Posts Tagged ‘Tokenization

2nd April, 2010

leave a comment »

Update 1.2. Released only here. Includes the following changes:

  1. In the Propbank Annotation Interface (Syntactic Annotation in the Propbank mode), it possible now to navigate by word stem and tags, for which files have to be provided. The default location of these files is Sanchay/workspace/syn-annotation. The file for word stem is named word-navigation-list.txt and the one for tags is named tag-navigation-list.txt. The navigation will work based on the word stem plus tag combination. For example, you can annotate all the main verbs (tag regex ‘”^VM”) with the stem कर (kara: do).
  2. In the Propbank mode, some of the dependency structure information will be hidden so that Propbank annotation can be performed without being biased by the dependency structure information and also to ensure that there is less chance of changing this information by mistake. You can still view the dependency tree in the tree visualizer and change it by drag and drop, but that will be fixed in one of the following updates.

In the previous (1.1) update, some further changes were made in the tokenizer so that it shouldn’t split sentence in the case of bullet points (in the case where the bullets are decimal numbers).

28th March, 2010

leave a comment »

Update 1 released for the Sanchay version 0.4.0 on

This is a minor update, so you can just download the jar file and replace the old one (in the Sanchay/dist directory) with this one.

There are the following changes/additions in this release:

1. In the Syntactic Annotation interface, when you open a raw text file (simple text without any annotation), the tokenizer that runs will now take into account the old Hindi/Sanskrit sentence/shloka/verse marker (e.g. ।। 1 ।।).

2. You can now convert the encoding of a syntactically annotated file, not just that of a simple text file. This facility has been connected to the Syntactic Annotation interface. (Click on the More button after opening a file).