Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lt-reweight: assign weights to a compiled transducer based on a corpus #24

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Executables built by this pacage:

* `lt-comp`: compiler, execute without parameters to show usage
instructions.

* `lt-proc`: processor, typical options are -a (lexical analyser,
default option), -g (lexical generator) and -p (lexical
post-generator). Using -h will show all flags.
Expand All @@ -31,6 +31,9 @@ Executables built by this pacage:

* `lt-print`: print the arcs of a transducer in [ATT format][3].

* `lt-reweight`: (experimental) utility to assign weights to a
compiled transducer based on a corpus.

There is also a C++ API that you can link to (see how [apertium][1] or
[apertium-lex-tools][2] do this).

Expand All @@ -54,4 +57,4 @@ Building & installing:

[1]: https://github.com/apertium/apertium
[2]: https://github.com/apertium/apertium-lex-tools
[3]: http://wiki.apertium.org/wiki/ATT_format
[3]: http://wiki.apertium.org/wiki/ATT_format
8 changes: 6 additions & 2 deletions lttoolbox/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ cc_sources = alphabet.cc att_compiler.cc compiler.cc compression.cc entry_token.
library_includedir = $(includedir)/$(PACKAGE_NAME)-$(VERSION_API)/$(PACKAGE_NAME)
library_include_HEADERS = $(h_sources)

bin_PROGRAMS = lt-comp lt-proc lt-expand lt-tmxcomp lt-tmxproc lt-print lt-trim
bin_PROGRAMS = lt-comp lt-proc lt-expand lt-tmxcomp lt-tmxproc lt-print lt-trim lt-reweight
instdir = lttoolbox

lib_LTLIBRARIES= liblttoolbox3.la
Expand Down Expand Up @@ -55,6 +55,10 @@ lt_tmxproc_SOURCES = lt_tmxproc.cc
lt_tmxproc_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_tmxproc_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

lt_reweight_SOURCES = lt_reweight.cc
lt_reweight_LDADD = liblttoolbox$(VERSION_MAJOR).la
lt_reweight_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)

#lt-validate-dictionary: Makefile.am validate-header.sh
# @echo "Creating lt-validate-dictionary script"
# @echo "#!$(BASH)" > $@
Expand All @@ -65,7 +69,7 @@ lt_tmxproc_LDFLAGS = -llttoolbox$(VERSION_MAJOR) $(LTTOOLBOX_LIBS)



man_MANS = lt-comp.1 lt-expand.1 lt-proc.1 lt-tmxcomp.1 lt-tmxproc.1 lt-print.1 lt-trim.1
man_MANS = lt-comp.1 lt-expand.1 lt-proc.1 lt-tmxcomp.1 lt-tmxproc.1 lt-print.1 lt-trim.1 lt-reweight.1

INCLUDES = -I$(top_srcdir) $(LTTOOLBOX_CFLAGS)
if WINDOWS
Expand Down
66 changes: 66 additions & 0 deletions lttoolbox/lt-reweight.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
.TH lt-reweight 1 2014-02-07 "" ""
.SH NAME
lt-reweight \- This application is part of the lexical processing modules
and tools (
.B lttoolbox
)
.PP
This tool is part of the apertium machine translation
architecture: \fBhttp://www.apertium.org\fR.
.SH SYNOPSIS
.B lt-trim
analyser_binary bidix_binary trimmed_analyser_binary
.PP
.SH DESCRIPTION
.BR lt-trim
is the application responsible for trimming compiled dictionaries. The
analyses (right-side when compiling lr) of analyser_binary are trimmed
to the input side of bidix_binary (left-side when compiling lr,
right-side when compiling rl), such that only analyses which would
pass through `lt-proc \-b bidix_binary' are kept.

\fBWarning: this program is experimental!\fR It has been tested, but
not deployed extensively yet.

Both compund tags (`<compound-only-L>', `<compound-R>') and join
elements (`<j/>' in XML, `+' in the stream) and the group element
(`<g/>' in XML, `#' in the stream) should be handled correctly, even
combinations of + followed by # in monodix are handled.

Some minor caveats: If you have the capitalised lemma "Foo" in the
monodix, but "foo" in the bidix, an analysis "^Foo<tag>$" would pass
through bidix when doing lt-proc \-b, but will not make it through
trimming. Make sure your lemmas have the same capitalisation in the
different dictionaries. Also, you should not have literal `+' or `#'
in your lemmas. Since lt-comp doesn't escape these, lt-trim cannot
know that they are different from `<j/>' or `<g/>', and you may get
@-marked output this way. You can analyse `+' or `#' by having the
literal symbol in the `<l>' part and some other string (e.g. "plus")
in the `<r>'.

You should not trim a generator unless you have a \fBvery\fR simple
translator pipeline, since the output of bidix seldom goes unchanged
through transfer.
.PP
.SH FILES
.B analyser_binary
The untrimmed analyser dictionary (a finite state transducer).
.PP
.B bidix_binary
The dictionary to use as trimmer (a finite state transducer).
.PP
.B trimmed_analyser_binary
The trimmed analyser dictionary (a finite state transducer).

.SH SEE ALSO
.I lt-comp\fR(1),
.I lt-proc\fR(1),
.I lt-print\fR(1),
.I lt-expand\fR(1),
.I lt-trim\fR(1),
.I apertium-tagger\fR(1),
.I apertium\fR(1).
.SH BUGS
Lots of...lurking in the dark and waiting for you!
.SH AUTHOR
(c) 2013--2014 Universitat d'Alacant / Universidad de Alicante.
18 changes: 11 additions & 7 deletions lttoolbox/lt_proc.cc
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ void endProgram(char *name)
cout << " -o, --surf-bilingual: lexical transfer with surface forms" << endl;
cout << " -p, --post-generation: post-generation" << endl;
cout << " -x, --inter-generation: inter-generation" << endl;
cout << " -s, --sao: SAO annotation system input processing" << endl;
// Deprecated:
// cout << " -s, --sao: SAO annotation system input processing" << endl;
cout << " -t, --transliteration: apply transliteration dictionary" << endl;
cout << " -v, --version: version" << endl;
cout << " -z, --null-flush: flush output on the null character " << endl;
Expand All @@ -79,7 +80,8 @@ void endProgram(char *name)
cout << " -o: lexical transfer with surface forms" << endl;
cout << " -p: post-generation" << endl;
cout << " -x: inter-generation" << endl;
cout << " -s: SAO annotation system input processing" << endl;
// Deprecated:
// cout << " -s: SAO annotation system input processing" << endl;
cout << " -t: apply transliteration dictionary" << endl;
cout << " -v: version" << endl;
cout << " -z: flush output on the null character " << endl;
Expand Down Expand Up @@ -124,7 +126,8 @@ int main(int argc, char *argv[])
{"tagged-nm-gen", 0, 0, 'm'},
{"post-generation", 0, 0, 'p'},
{"inter-generation", 0, 0, 'x'},
{"sao", 0, 0, 's'},
// Deprecated:
// {"sao", 0, 0, 's'},
{"transliteration", 0, 0, 't'},
{"null-flush", 0, 0, 'z'},
{"dictionary-case", 0, 0, 'w'},
Expand All @@ -143,9 +146,9 @@ int main(int argc, char *argv[])
{
#if HAVE_GETOPT_LONG
int option_index;
int c = getopt_long(argc, argv, "abcegi:r:lmndopxstzwvCIWN:L:h", long_options, &option_index);
int c = getopt_long(argc, argv, "abcegi:r:lmndopxtzwvCIWN:L:h", long_options, &option_index);
#else
int c = getopt(argc, argv, "abcegi:r:lmndopxstzwvCIWN:L:h");
int c = getopt(argc, argv, "abcegi:r:lmndopxtzwvCIWN:L:h");
#endif

if(c == -1)
Expand Down Expand Up @@ -210,7 +213,6 @@ int main(int argc, char *argv[])
case 'p':
case 'x':
case 't':
case 's':
case 'C':
if(cmd == 0)
{
Expand Down Expand Up @@ -362,12 +364,14 @@ int main(int argc, char *argv[])
fstp.intergeneration(input, output);
break;

/** Deprecated:

case 's':
fstp.initAnalysis();
checkValidity(fstp);
fstp.SAO(input, output);
break;

*/
case 't':
fstp.initPostgeneration();
checkValidity(fstp);
Expand Down
Loading