-
Notifications
You must be signed in to change notification settings - Fork 38
/
research.html
1460 lines (1408 loc) · 64.4 KB
/
research.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="stylesheet"
href="css/jquery-ui.css"
type="text/css" />
<link rel="stylesheet" href="css/research.css" type="text/css" />
<script type="text/javascript" src="js/jquery-1.4.2.js"></script>
<script type="text/javascript" src="js/jquery.ui.core.js"></script>
<script type="text/javascript" src="js/jquery.ui.widget.js"></script>
<script type="text/javascript" src="js/jquery.ui.tabs.js"></script>
<script type="text/javascript">
$(function() {
$("#tabs").tabs({
select: function (e, ui) {
window.location.replace(ui.tab.hash);
var $panel = $(ui.panel);
if ($panel.is(":empty")) {
$panel.append("<div class='tab-loading'>Loading...</div>")
}}
});
});
</script>
<title>
CLAIR Research
</title>
</head>
<body>
<!-- It is unclear to me how a giant, unmaintainable JavaScript
object is preferable to nicely broken up HTML. This is more
maintainable, for one thing, and you have less obnoxious
line-break behaviour. Also, this version degrades gracefully
and shows people using w3m or lynx something other than a
blank page. To add a new tab to this section, add a new item
to the unordered list below, then create a new div with the
same id. Clone and hack should do you fine. -->
<div id="tabs">
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#clairlib">Clairlib</a></li>
<li><a href="#politext">PoliText</a></li>
<!--li><a href="#dynamicsalience">Dynamic Salience</a></li-->
<li><a href="#facets">Facets</a></li>
<li><a href="#ssknn">SSkNN</a></li>
<li><a href="#gin">GIN</a></li>
<li><a href="#gin-na">GIN-NA</a></li>
<li><a href="#gin-ie">GIN-IE</a></li>
<li><a href="#bioevents">BioEvents</a></li>
<li><a href="#biocontext">BioContext</a></li>
<li><a href="#speculation">Speculation</a></li>
<li><a href="#tumbl">Tumbl</a></li>
<li><a href="#lexrank">LexRank</a></li>
<li><a href="#mead">MEAD</a></li>
<li><a href="#blogocenter">BlogoCenter</a></li>
<li><a href="#aan">AAN</a></li>
<li><a href="#iopener">iOpener</a></li>
<li><a href="#nsir">NSIR</a></li>
<li><a href="#collectivediscourse">Collective Discourse</a></li>
<li><a href="#gbnlpir">Graph-Based NLP/IR</a></li>
<li><a href="#scil">SCIL</a></li>
<li><a href="#SoCS">SoCS</a></li>
<li><a href="#FUSE">FUSE</a></li>
</ul>
<div id="intro" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Introduction</h1>
<p>
The CLAIR (Computational Linguistics And Information
Retrieval) research group focuses on text analysis,
natural language processing, information retrieval, and
network analysis. Specific projects involve text
summarization, question answering, topic modeling, and
bibliometrics. The applications areas include
bioinformatics, political science, social media
analysis, and others.
</p>
</div>
<div class="col2">
</div>
</div>
</div>
<div id="FUSE" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Foresight and Understanding from Scientific Exposition (FUSE)</h1>
<p>(from the IARPA FUSE page)
The FUSE Program seeks to develop automated methods that aid in the systematic, continuous, and comprehensive assessment of technical emergence using information found in the published scientific, technical, and patent literature. FUSE envisions a system that could (1) process the massive, multi-discipline, growing, noisy, and multilingual body of full-text scientific, technical, and patent literature from around the world; (2) automatically generate and prioritize RDGs, nominate those that exhibit technical emergence, and provide compelling evidence for that emergence; and (3) provide this capability for literatures in English and at least two non-English languages. The FUSE Program will also address the vital challenge of validating such a system, using real world data.</p>
<p><ul>
<li><a href="http://clair.si.umich.edu/fuse">Umich FUSE page</a></li>
<li><a href="http://clair.si.umich.edu/anthology">ACL Anthology Network</a></li>
<li><a href="http://www.iarpa.gov/solicitations_fuse.html">IARPA FUSE Page</a></li>
</ul></p>
<!--<h2>Papers</h2>
<ul class="links">
</ul>-->
</div><div class="col2">
<h2>People</h2>
<ul class="people">
<li>Aditya Tayade</li>
<li>Ben King</li>
<li>Paritosh Aggarwal</li>
<li>Rahul Jha</li>
<li>Wanchen Lu</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="SoCS" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Assessing Information Credibility Without Authoritative Sources
</h1>
<p>(from Paul Resnick's page)
This project will develop tools that help people make personal
assessments of credibility. Rather than relying on particular sources
as authoritative arbiters of ground truth, the goal is to minimize the
amount of "social implausibility". That is, the tool will identify
assertions that are disbelieved by "similar" people (those who, after
careful consideration, someone tended to agree with in the past) or
come from sources that someone has tended to disagree with. A text
mining system for online media will be developed to extract
controversial assertions and the beliefs expressed by users about
those assertions. Comparisons of beliefs about common assertions, and
retractions or updates to beliefs, will be tracked as part of
personalized reputation measures.
(Joint work with Qiaozhu Mei, Rahul Sami, and Dragomir Radev. Funded
by NSF under Grant No. IIS- 0968489.)</p>
<h2>Papers</h2>
<ul class="links">
<li><a href="http://clair.si.umich.edu/~radev/papers/EMNLP.pdf">
Vahed Qazvinian; Emily Rosengren; Dragomir R. Radev; and Qiaozhu Mei "Rumor has it: Identifying Misinformation in Microblogs"<cite>Empirical Methods on Natural Language Processing </cite><cite>(EMNLP 2011).</cite></a></li>
</li>
</ul>
</div><div class="col2">
<h2>People</h2>
<ul class="people">
<li>Vahed Qazvinian</li>
<li>Emily Rosengren</li>
<li>Qiaozhu Mei</li>
<li>Dragomir Radev</li>
<li>Paul Resnick</li>
<li>Rahul Sami</li>
<li>Pradeep Muthukrishnan</li>
</ul>
</div>
</div>
</div>
<div id="scil" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Sociolingustics/SCIL</h1>
Mining sentiment from user generated content is a very important task in Natural Language Processing.
An example of such content is threaded discussions which act as a very important tool for communication
and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards,
newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding
the sentiment toward products or topics.The SCIL project aims to develop tools for predicting power,
influence, and rifts in social groups through linguistic analysis. The languages of focus are English, Arabic, and Urdu.
<h2>Demonstration</h2>
<ul class="links">
<li><a href="http://clair.eecs.umich.edu/SubgroupDetector/">Demo (Vector Clustering Approach)</a></li>
<li><a href="http://clair.eecs.umich.edu/subgroup_detector/index.php">Demo (Signed Network Partitioning Approach)</a></li>
</ul>
<h2>Papers</h2>
<ul class="links">
<li><a href="http://clair.si.umich.edu/~radev/papers/P11-2104.pdf">
Ahmed Hassan; Amjad Abu-Jbara; Rahul Jha, and Dragomir Radev; "Identifying the Semantic Orientation of Foreign Words.
The 49th Annual Meeting of the Association for Computational Linguistics"<cite> (ACL 2011). </cite></a></li>
<br/>
<li><a href="http://clair.si.umich.edu/~radev/papers/EMNLP121.pdf">
Ahmed Hassan; Vahed Qazvinian; and Dragomir Radev."What's with the Attitude? A study of Participant Attitude in Multi-Party Online Discussions
The 2010 Conference on Empirical Methods in Natural Language Processing" <cite>(EMNLP 2010) </cite></a> </li>
<br/>
<li><a href="http://clair.si.umich.edu/~radev/papers/P10-1041.pdf">
Ahmed Hassan; and Dragomir Radev. "Identifying Text Polarity Using Random Walks.The 48th Annual Meeting of the Association for Computational Linguistics "<cite>(ACL 2010). </cite></a></li>
<br/>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Dragomir Radev</li>
<li>Ahmed Hassan</li>
<li>Vahed Qazvinian</a>
</ul>
</div>
</div>
</div>
<div id="clairlib" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Clairlib: The Meta-Project</h1>
<img alt="" class="logo"
src="images/clairlib-330.png" />
<p>
Clairlib is a suite of open-source Perl modules
developed and maintained by the Computational
Linguistics And Information Retrieval (CLAIR) group at
the University of Michigan. Clairlib is intended to
simplify a number of generic tasks in natural language
processing (NLP), information retrieval (IR), and
network analysis (NA). The latest version of clairlib is
1.08 which was released on September 2009 and includes about
150 modules implementing a wide range of
functionalities.
</p>
<p>
Clairlib is distributed in two forms: Clairlib-core,
which has essential functionality and minimal dependence
on external software, and Clairlib-ext, which has
extended functionality that may be of interest to a
smaller audience. Much can be done using Clairlib on its
own. Some of the things that Clairlib can do are:
Tokenization, Summarization, Document Clustering,
Document Indexing, Web Graph Analysis, Network
Generation, Power Law Distribution Analysis, Network
Analysis, RandomWalks on Graphs, Tf-IDF, Perceptron
Learning and Classification, and Phrase Based Retrieval
and Fuzzy OR Queries.
</p>
<p>
Clairlib modules are available for download on
www.clairlib.org . Installation instructions and modules
documentation is also available in both PDF and HTML
formats. Clairlib comes with a lot of code examples and
a set of useful tutorials on using its modules in
various applications.
</p>
<p>This <a href="cl-demo.pdf">paper</a> describes clairlib.
<p>
This work has been supported in part by National
Institutes of Health grants R01 LM008106 "Representing
and Acquiring Knowledge of Genome Regulation" and U54
DA021519 "National center for integrative
bioinformatics", as well as by grants IDM 0329043
"Probabilistic and link-based Methods for Exploiting
Very Large Textual Repositories", DHB 0527513 "The
Dynamics of Political Representation and Political
Rhetoric", 0534323 "Collaborative Research: BlogoCenter
- Infrastructure for Collecting,Mining and Accessing
Blogs", and 0527513 "The Dynamics of Political
Representation and Political Rhetoric", from the
National Science Foundation.
</p>
<h2>Links</h2>
<ul class="links">
<li><a href="http://clairlib.org">Project
Website</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Dragomir Radev</li>
<li>Mark Hodges</li>
<li>Anthony Fader</li>
<li>Mark Joseph</li>
<li>Joshua Gerrish</li>
<li>Mark Schaller</li>
<li>Jonathan dePetri</li>
<li>Bryan Gibson</li>
<li>Chen Huang</li>
<li>Amjad Abu Jbara</li>
<li>Prem Ganeshkumar</li>
</ul>
</div>
</div>
</div>
<div id="politext" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Analyzing Political Speech</h1>
<!--img alt="" src="images/poli_sci-330.png" class="logo" /-->
<img alt="" src="images/polisci-small-330.png" class="logo" />
<p>
This project, representing one of the first major collaborations across linguistics and political science (as well as statistics, computer science, and information science), seeks to exploit the opportunities presented by emerging electronic records, both contemporary and historical, of legislative debates around the world. For political science, such records represent a uniquely detailed account, ranging across multiple time scales, of elite positions on political issues and their dynamics. For linguistics, such records present a unique account of spoken word in a controlled setting, ranging across time scales from minutes to centuries. The massive scale of the databases involved presents statistical and computational challenges of interest and application in other fields. The project seeks to develop the methodological and computational infrastructure necessary to exploit these data for a unique interdisciplinary and multidisciplinary understanding of dynamics in human political and linguistic behavior. Some of the tasks that the project addressed are: topic modeling of legislative speech, identifying influential members of the us senate, tracking how members' influence
vary with time, and modeling political attention.
</p>
<p>
One of the important tasks we addressed in this project is to study influence and salience in political discussions.
We introduced a technique for identifying the most
salient participants in a discussion. Our method
is based on lexical centrality: a random walk
is performed on a graph in which each node is a
participant in the discussion and an edge links two
participants who use similar rhetoric. As a test, we
used MavenRank to identify the most influential members
of the US Senate using data from the US Congressional
Record and used committee ranking to evaluate the
output. Our results show that scores are
largely driven by committee status in most topics, but
can capture speaker centrality in topics where speeches
are used to indicate ideological position instead of
influence legislation. We also introduced a technique for analyzing the
temporal evolution of the salience of participants in a discussion. Our method can
dynamically track how the relative importance of speakers evolve over time using
graph based techniques. The method is dynamic in the sense that the graph evolves
over time to capture the evolution inherent to the participants salience. We used
our method to track the salience of members of the US Senate using data from the
US Congressional Record. Our analysis investigated how the salience of speakers
changes over time. Our results show that the scores can capture speaker centrality
in topics as well as events that result in change of salience or influence among different participants.
</p>
<!--<h2>Links</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/clair/clair/poliscitopics.html">Topic
Identification</a></li>
<li><a
href="http://clair.si.umich.edu/clair/clair/poliscispeakers.html">Identifying
Central Speakers</a></li>
</ul> -->
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/AJPS2010.pdf">Quinn,
Kevin; Monroe, Burt; Colaresi, Michael; Crespin,
Michael; Radev, Dragomir R. “How to Analyze Political
Attention with Minimal Assumptions and
Costs”. <cite>American Journal of Political
Science</cite>. 2010.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/mpsa06.pdf">Quinn,
Kevin M.; Monroe, Burt L.; Colaresi, Michael; Crespin,
Michael H.; Radev, Dragomir R. “An Automated Method of
Topic-Coding Legislative Speech Over Time with
Application to the 105th–108th
U.S. Senate”. <cite>Midwest Political Science
Association Meeting</cite>. 2006.</a></li>
<li><a
href="http://clair.si.umich.edu/radev/papers/coling08a.pdf">
Hassan, Ahmed; Fader, Anthony; Crespin, Michael; Quinn,
Kevin; Monroe, Burt; Colaresi, Michael; Radev, Dragomir
R. “Tracking the dynamic evolution of participant
salience in a discussion”. <cite>COLING
2008</cite>. Manchester, UK. 2008.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/emnlp07polisci.pdf">Fader,
Anthony; Radev, Dragomir R.; Crespin, Michael H.;
Monroe, Burt L.; Quinn, Kevin M.; Colaresi,
Michael. “MavenRank: Identifying Influential Members of
the US Senate Using Lexical Centrality”.
<cite>Proceedings of the Conference of Empirical Methods
in Natural Language Processing (EMNLP
'07)</cite>. Prague, Czech Republic. June
28–30. 2007.</a></li>
</ul>
</div>
<div class="col2">
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Dragomir Radev</li>
<li>Ahmed Hassan</li>
<li>Anthony Fader</li>
</ul>
</div>
</div>
</div>
<div id="facets" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Detecting Multiple Facets of an Event Using
Graph-Based Unsupervised Methods</h1>
<img alt="" src="images/vtech_topics-330.png"
class="logo" />
<p>
We propose two new unsupervised methods to extract
different facets about news events from blog
postings. Both methods are a two step process with the
first step generating different candidate facets using
Kullback-Leibler divergence and the second step focuses
on selecting a set of facets which cover a chosen
space of documents while maximizing the diversity of the
facets themselves.
</p>
<p>
The two algorithms vary in selecting which documents to cover. The first algorithm attempts to pick facets such that they cover the entire space of documents. However, the second algorithm takes as input the number of topics the user requests. Each document is weighted such that broad (narrow) facets or topics are chosen if the number of topics requested is small (large).
</p>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/coling08b.pdf">Muthukrishnan,
Pradeep; Gerrish, Joshua; Radev, Dragomir
R. “Detecting Multiple Facets of an Event Using
Graph-Based Unsupervised Methods”. <cite>COLING
2008.</cite> Manchester, UK. 2008.</a></li>
</li>
<li>
Muthukrishnan, Pradeep; Radev, Dragomir R.;
"Adaptive Detection Of Multiple Facets Of An Event Using Graph-Based Unsupervised Methods".
Submitted to Knowledge and Information Systems (KAIS), 2011.
</li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Pradeep Muthukrishnan</li>
<li>Joshua Gerrish</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="ssknn" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Improved Nearest Neighbor Methods For Text
Classification With Language Modeling and Harmonic
Functions</h1>
<img alt="" src="images/ssknn-330.png"
class="logo" />
<p>
In this project, we presented new nearest neighbor
methods for text classification and an evaluation of
these methods against the existing nearest neighbor
methods as well as other well-known text classification
algorithms. Inspired by the language modeling approach
to information retrieval, we show improvements in
k-nearest neighbor (kNN) classification by replacing the
classical cosine similarity with a KL divergence based
similarity measure. We also present an extension of kNN
to the semi-supervised case which turns out to be a
formulation that is equivalent to semi-supervised
learning with harmonic functions. In both supervised and
semi-supervised experiments, our algorithms surpass the
state-of-the-art methods such as Support Vector Machines
(SVM) and transductive SVM on the Reuters Corpus Volume
I (RCV1), and the 20 Newsgroups dataset and produce
competitive results on the Reuters-21578 dataset. To our
knowledge, this paper presents the most comprehensive
evaluation of different machine learning algorithms on
the entire RCV1 dataset.
</p>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/tc.pdf"> Gunes Erkan, Ahmed Hassan, and Dragomir Radev.
Improved Nearest Neighbor Methods For Text Classifcation With Language Modeling and Harmonic Functions.
<cite>Submitted to Computational Intelligence</cite>. 2011.</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Gunes Erkan</li>
<li>Ahmed Hassan</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="gin" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Gene Interaction Network</h1>
<img alt="" src="images/gin-330.png" class="logo"
/>
<p>
GIN (Gene Interaction Network) is a system for browsing
articles and molecule interaction information. What
makes GIN stand out from other similar systems is that
it uses automated methods (such as dependency parsing)
to mine the text for relevant information (such as
protein interactions) and computes statistics for the
interaction network. The user can browse articles with
highlighted summary sentences, citing sentences
(sentences from other articles that cite the article in
question), and interaction sentences. The user can also
browse molecules to view their interactions,
neighborhood, and other network statistics.
</p>
<!--<h2>Demonstration</h2>
<ul class="links">
<li><a href="http://belobog.si.umich.edu:8080/gin/">Demo
Site</a></li>
</ul>-->
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Thuy Vu</li>
<li>Gunes Erkan</li>
<li>Anthony Fader</li>
<li>Joshua Gerrish</li>
<li>Mark Schaller</li>
<li>Dragomir Radev</li>
<li>Amjad abu Jbara</li>
</ul>
</div>
</div>
</div>
<div id="gin-na" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>GIN-NA: Gene Interaction Network Analysis</h1>
<p>
GIN-NA is a system for analysing molecule interaction
networks. The interaction networks are retrieved from
the MiMI database, which integrates protein interactions
from diverse biological data sources. Analysis of two
types of networks are performed, namely
molecule-specific networks and disease-specific
networks. Molecule-specific networks are the networks of
interactions in the neighborhood of a molecule. Besides
the general network statistics such as average degree,
power-law degree distribution, clustering coefficient,
and shortest path statistics, GIN-NA ranks the molecules
in the network based on graph centrality measures and
second neighbor statistics. Disease-specific networks
are built by compiling lists of known disease genes and
retrieving the interactions among these genes and their
neighborhood. We hypothesize that the genes central in
the disease-specific gene interaction network are likely
to be related to the disease and rank the genes based on
their centrality scores. Currently, GIN-NA provides
disease-specific networks for the four Driving
Biological Problems, Prostate Cancer, Type 1 Diabetes,
Type 2 Diabetes, and Bipolar Disorder.
</p>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/nar09.pdf">Tarcea,
V. G.; Weymouth, T.; Ade, A.; Bookvich, A.; Gao, J.;
Mahavisno, V.; Wright, Z.; Chapman, A.; Jayapandian, M.;
Özgür, A.; Tian, Y.; Cavalcoli, J.; Mirel, B.; Patel,
J.; Radev, D.; Athey, B.; States, D.; Jagadish, H. V.
“Michigan Molecular Interactions (MiMI) r2: From
Interacting Proteins to Pathways”. <cite>Nucleic Acids
Research</cite> 37: January,
2009. pp. D642–D646.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/JBB-vaccine-2010.pdf">Özgür,
Arzucan; Xiang, Zhuohuang; Radev, Dragomir R.; He,
Yongqun. “Literature-Based Discovery of IFN-gamma and
Vaccine-Mediated Gene Interaction
Networks”. <cite>Journal of Biomedicine and
Biotechnology</cite>. 2010.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/bioinformatics08.pdf">Özgür,
Arzucan; Vu, Thuy; Erkan, Güneŝ; Radev, Dragomir
R. “Identifying Gene-Disease Associations Based on
Centrality on a Literature Mined Gene Interaction
Network”. <cite>Bioinformatics</cite> 24:
2008. pp. i277–i285.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/JBS11.pdf">Arzucan Özgür, Zhuohuang Xiang, Dragomir R. Radev, and Yongqun He.
Mining of vaccine-associated IFN-gamma gene interaction networks
using the Vaccine Ontology.<cite> Journal of Biomedical Semantics,</cite> 2(Suppl
2):S8, 2011.</a></li>
</ul>
</div>
<div class="col2">
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Dragomir Radev</li>
<li>Amjad abu Jbara</li>
</ul>
</div>
</div>
</div>
<div id="gin-ie" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>GIN-IE: Gene Interaction Extraction from the
Literature</h1>
<img alt="" src="images/prot_int-330.png"
class="logo" />
<p>Besides the fact that there is a relationship between a pair of
molecules, the context information such as the type and the
directionality are also important. To extract the relationships and
their types and directionalities we use the sentences, and their
dependency parse tree structures, which enables us to make syntax-
aware inferences about the roles of the entities in a sentence. We
investigate both machine learning based approaches and rule-based
approaches. We extract paths between a protein pair in the dependency
parse tree of a sentence and define two kernel functions for SVM
based on the cosine and edit distance based similarities among these
paths. We participated in the BioCreative Meta-Server Project, which
is a platform for integrating text mining and information extraction
services for Molecular Biology. We contributed by providing an
annotation server that classifies biomedical articles as describing
protein-protein interaction(s) or not using the path edit kernel with
SVM. While machine learning based approaches achieve more balanced
precision-recall performances, rule-based methods achieve higher
precision in the expense of recall. High precision is an important
requirement for most real-life applications. The high precision
interaction extraction pipeline is integrated with the daily
processing of the Pubmed updates pipeline at NCIBI. The extracted
interactions are published as an RSS feed and are also available
through the Michigan Molecular Interactions (MiMI) system.
</p>
<h2>Demonstrations</h2>
<ul class="links">
<li><a
href="http://mimi.ncibi.org/">Demo
Site</a></li>
</ul>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/nar09.pdf">Tarcea,
V. G.; Weymouth, T.; Ade, A.; Bookvich, A.; Gao, J.;
Mahavisno, V.; Wright, Z.; Chapman, A.; Jayapandian, M.;
Özgür, A.; Tian, Y.; Cavalcoli, J.; Mirel, B.; Patel,
J.; Radev, D.; Athey, B.; States, D.; Jagadish,
H. V. “Michigan Molecular Interactions (MiMI) r2: from
Interacting Proteins to Pathways”. <cite>Nucleic Acids
Research</cite> 37: January, 2009. pp. D642–D646.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/JBB-vaccine-2010.pdf">Özgür,
Arzucan; Xiang, Zhuohuang; and Radev, Dragomir R.;
He, Yongqun. “Literature-Based Discovery of
IFN-gamma and Vaccine-Mediated Gene Interaction
Networks”. <cite>Journal of Biomedicine and
Biotechnology</cite>. 2010.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/bioinformatics08.pdf">Özgür,
Arzucan; Vu, Thuy; Erkan, Güneŝ; Radev, Dragomir
R. “Identifying Gene-Disease Associations Based on
Centrality on a Literature Mined Gene Interaction
Network”. <cite>Bioinformatics</cite>. 24. pp. i277–i285. 2008.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/genome08.pdf">Leitner,
Florian; Krallinger, Martin; Rodriguez-Penagos, Carlos;
Hakenberg, Joerg; Plake, Conrad; Kuo, Cheng-Ju; Hsu,
Chun-Nan; Tasi, Richard Tzong-Han; Hung, Hsi-Chuan; Lau,
William W.; Johnson, Calvin A.; Saetre, Rune; Yoshida,
Kazuhiro; Chen, Yan Hua; Kim, Sun; Shin, Soo-Yong;
Zhang, Byoung-Tak; Baumgartner., William A.; and Hunter,
Lawrence; Haddow, Barry; Matthew, Michael; Wang,
Xinglong; Ruch, Patrick; Ehrler, Frederic; Özgür,
Arzucan; Erkan, Güneŝ; Radev, Dragomir R.; Krauthammer,
Michael; Luong, ThaiBinh; Hoffman, Robert; Sander,
Chris; Valencia, Alfonso. “Introducing Meta-Services for
Biomedical Information Extraction”. <cite>Genome
Biology</cite>. 9. September, 2008.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/emnlp07bio.pdf">Erkan,
Güneŝ; Özgür, Arzucan; Radev, Dragomir
R. “Semi-Supervised Classification for Extracting
Protein Interaction Sentences Using Dependency
Parsing”. <cite>Proceedings of the Conference of
Empirical Methods in Natural Language Processing (EMNLP
'07)</cite>. Prague, Czech Republic. June 28–30,
2007.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/biocreative07.pdf">Erkan,
Güneŝ; Özgür, Arzucan; Radev, Dragomir R. “Extracting
Interacting Protein Pairs and Evidence Sentences by
using Dependency Parsing and Machine Learning
Techniques”. <cite>Proceedings of the Second BioCreAtIvE
Challenge Workshop - Critical Assessment of Information
Extraction in Molecular Biology</cite>. April 23–25,
2007.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/JBS11.pdf">Arzucan Özgür, Zhuohuang Xiang, Dragomir R. Radev, and Yongqun He.
Mining of vaccine-associated IFN-gamma gene interaction networks
using the Vaccine Ontology.<cite> Journal of Biomedical Semantics,</cite> 2(Suppl
2):S8, 2011.</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Gunes Erkan</li>
<li>Dragomir Radev</li>
<li>Amjad abu Jbara</li>
</ul>
</div>
</div>
</div>
<div id="bioevents" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Extracting Biomedical Events from the Literature</h1>
<p>
Most previous work on biomedical information extraction
focuses on identifying relationships among biomedical
entities (e.g. protein-protein interactions). Unlike
relationships, which are in general characterized with a
pair of entities, events can be characterized with event
types and multiple entities in varying roles. The
BioNLP\'09 Shared Task addresses the extraction of
bio-molecular events from the biomedical literature. We
participated in the “Event Detection and
Characterization” task (Task 1). The goal was to
recognize the events concerning the given proteins by
detecting the event triggers, determining the event
types, and identifying the event participants. We group
the event types into three general classes based on the
number and types of participants that they involve. The
first class includes the event types that are described
with a single theme participant. The second class
includes the event types that are described with one or
more theme participants. The third class includes the
events that are described with a theme and/or a cause
participant. We learn support vector machine (SVM)
models for each class of events to classify each
candidate event trigger/participant pair as a real
trigger/participant pair or not. We use various types of
linguistic features such as lexical, positional, and
dependency relation features that represent the contexts
of the candidate trigger/participant pairs.
</p>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/BioNLP-2009.pdf">Özgür,
Arzucan; Radev, Dragomir R. “Supervised
Classification for Extracting Biomedical
Events”. <cite>Proceedings of the BioNLP'09
Workshop Shared Task on Event Extraction at
NAACL-HLT</cite>. Boulder, Colorado. June,
2009.</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="biocontext" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Extracting Non-local Context for Biomedical
Information Extraction</h1>
<p>
Most previous studies focus on extracting relationships
between pairs of molecules. However, the context
information such as the type, the directionality, the
location, and the condition of the relationship are also
important. While some types of context information such
as the relationship type and directionality can be
extracted locally from the sentence, other types of
context information such as the experimental method and
the species are not always found in the sentence, but
need to be extracted non-locally from the entire
document. We created guidelines for corpus annotation
for non-local (document-level) context extraction. We
are annotating full text articles for species
mentions. The articles are retrieved from PubMed Central
Open Access. We approach the problem as identifying the
linguistic scope of each species mention in the
article. We defined scope classes such as entity,
sentence, paragraph, section, and article. For example,
the scope of a species mention is entity level, if it
applies to a certain entity (gene/protein) in the
sentence. On the other hand, if it applies to all the
entities in the paragraph its scope is defined to be
paragraph level. The annotated corpus will enable us to
learn models for identifying the species of the molecule
mentions in the text.
</p>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="speculation" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Detecting Speculations and Resolving their Scopes in
Scientific Text</h1>
<p>
Speculation is a frequently used language phenomenon in
biomedical scientific articles. When researchers are not
completely certain about the inferred conclusions, they
use speculative language to convey this
uncertainty. While speculative information might still
be useful for biomedical scientists, it is important
that it is distinguished from the factual
information. We introduce an approach which is based on
solving two sub-problems to identify speculative
sentence fragments. The first sub-problem is identifying
the speculation keywords in the sentences and the second
one is resolving their linguistic scopes. We formulate
the first sub-problem as a supervised classification
task, where we classify the potential keywords as real
speculation keywords or not by using a diverse set of
linguistic features that represent the contexts of the
keywords. After detecting the actual speculation
keywords, we use the syntactic structures of the
sentences to determine their scopes.
</p>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/EMNLP145.pdf">Özgür,
Arzucan; Radev, Dragomir R. “Detecting Speculations and
Their Scopes in Scientific
Text”. <cite>EMNLP</cite>. Singapore. 2009.</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Arzucan Ozgur</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="tumbl" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Graph-Based Semi-supervised Learning</h1>
<img alt="" src="images/tumbl-330.png"
class="logo" />
<p>
Tripartite updating is related to the principal
eigenvector of a stochastic Markov process. This
algorithm is a variant of the HITS algorithm (it uses a
bipartite underlying structure and its stationary
solution is computed iteratively), though it differs
from it in three important ways: (a) the "right-hand"
component of the graph is split into two groups: labeled
and unlabeled data instances - therefore the name
"tripartite", (b) there is an initial assignment of
values for the labeled examples, and (c) the scores of
the labeled examples are not allowed to change with
time.
</p>
<h2>Demonstrations</h2>
<ul class="links">
<li><a href="../demos/tumbl">Demo
Site</a></li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Dragomir Radev</li>
<li>Gunes Erkan</li>
</ul>
</div>
</div>
</div>
<div id="lexrank" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Lexical Networks and Lexical Centrality</h1>
<img alt="" src="images/Lexnet-330.png"
class="logo" />
<p>
We introduce a stochastic graph-based method for
computing relative importance of textual units for
Natural Language Processing. We consider a new approach,
LexRank, for computing sentence importance based on the
concept of eigenvector centrality in a graph
representation of sentences. In this model, a
connectivity matrix based on intra-sentence cosine
similarity is used as the adjacency matrix of the graph
representation of sentences. The results show that
degree-based methods (including LexRank) outperform both
centroid-based methods and other systems participating
in DUC in most of the cases.
</p>
<h2>Links</h2>
<ul class="links">
<li> <a
href="projects/lexnets/lexnets.html">Lexical
networks</a></li>
</ul>
<h2>Demonstrations</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/demos/lexrank">Lexical
networks and lexical centrality</a></li>
<li><a
href="#">LexRankMead</a> (Temporarily unavailable)</li>
</ul>
</div>
<div class="col2">
<h2>People</h2>
<ul class="people">
<li>Gunes Erkan</li>
<li>Jahna Otterbacher</li>
<li>Dragomir Radev</li>
</ul>
</div>
</div>
</div>
<div id="mead" class="colmask rightmenu">
<div class="colleft">
<div class="col1">
<h1>Text Summarization</h1>
<img alt="" src="images/mead-330.png" class="logo"
/>
<p>
MEAD is the most elaborate publicly available platform
for multi-lingual summarization and evaluation.The
platform implements multiple summarization algorithms
such as position-based, centroid-based, largest common
subsequence, and keywords. The methods for evaluating
the quality of the summaries are both intrinsic and
extrinsic. MEAD implements a battery of summarization
algorithms, including baselines (lead-based and random)
as well as centroid-based and query-based methods.
</p>
<h2>Links</h2>
<ul class="links">
<li><a href="http://www.summarization.com/mead">MEAD Resources</a></li>
<li><a href="http://clair.si.umich.edu/clair/CSTBank/">CSTBank</a></li>
<li><a href="http://www.summarization.com/summbank/">SUMMBank</a></li>
</ul>
<h2>Demonstrations</h2>
<ul class="links">
<li><a
href="http://www.summarization.com/mead">MEAD</a></li>
<li><a href="#">NewsInEssence
(Currently non-functional)</a></li>
</ul>
<h2>Papers</h2>
<ul class="links">
<li><a
href="http://clair.si.umich.edu/~radev/papers/elsevier08.pdf">Otterbacher,
Jahna; Radev, Dragomir; Kareem, Omer. “Hierarchical
Summarization for Delivering Information to Mobile
Devices”. <cite>Information Processing and
Management</cite> 44.2:
2008. Elsevier. pp. 931–947.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/sigir06.pdf">Otterbacher,
Jahna; Radev, Dragomir; Kareem, Omer. “News
to Go: Hierarchical Text Summarization for Mobile
Devices.” <cite>29th Annual ACM SIGIR Conference
on Research and Development in Information
Retrieval</cite>. Seattle, Washington. August,
2006.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/jdoc08.pdf">Otterbacher,
Jahna; Radev, Dragomir. “Exploring Fact-focused
Relevance and Novelty Detection”. <cite>Journal of
Documentation</cite> 64.4: 2008. Emerald.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/lrec08j.pdf">Otterbacher,
Jahna; Radev, Dragomir R. “Modeling Document Dynamics:
An Evolutionary Approach”. <cite>LREC</cite>. Marrakech,
Morocco. May, 2008.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/csetr537-07.pdf">
Otterbacher, Jahna; Shen, Siwei; Radev, Dragomir R.; Ye,
Yang. “Tracking Factual Information in Evolving
Text: An Empirical Study”. University of
Michigan. Department of Electrical Engineering and
Computer Science. SE-TR-537-07. 2007.</a></li>
<li><a
href="http://clair.si.umich.edu/~radev/papers/sigir06poster.pdf">Otterbacher,
Jahna; Radev, Dragomir. “Fact-focused Novelty Detection:
a Feasibility Study”. <cite>Poster session, 29th Annual
ACM SIGIR Conference on Research and Development in
Information Retrieval</cite>. Seattle,
Washington. August, 2006.</a></li>