Oh yeah, Scandinavia.

Posted on 24 October 2006 at 21:45 by vika. Categories: big wide world, news, politics.

Meant to write a mini-news-update on Denmark and Norway, who’ve had a particularly productive news day today. Behold:

Denmark leads social justice rankings, says a German think tank. *wistful sigh* Color me transplant-wannabe, and that’s just the first link.

Over 7000 Swedes commute to Denmark daily for work, and a new EU directive may relieve the tax burden on Danish employers, who at the moment are technically supposed to pay a sizeable chunk of cash in taxes to the Swedish government in addition to what they already pay to the Danish one. I’m not sure how it is that Sweden wins, here; it’s likely to be a touch-and-go process. But if they do succeed in working something out to everyone’s benefit, great.

Denmark, the brand name. They’re putting forth a serious effort to promote their country, presumably to drum up tourism and improve the country’s image (as if it needs to be improved, much). Go go Denmark gadget; given funds availability, I’d go there again in a heartbeat. Then again, see transplant above.

Compare and contrast to Norway, whose chief profits are still coing from oil. From conversations with Jill a few years ago, Norway at least seems to be going about oil production more responsibly than most other countries that have access to this resource.

“High consumption lands Norway among world’s worst: Norway, which generally prides itself on maintaining high environmental standards, seems to actually be using way more than it should of the world’s natural resources.” Oh yeah, Norway? Well, the good ol’ USA is second in the worst-offenders list, compared to your paltry 11th! We sure showed you!

Oy.

Finally, the young Norwegian who cracked DVD protection a few years ago claims to have done the same with the dread iTunes/iPod combo. “Johansen claims he’s mastered the inner workings of the iPod and its FairPlay encryption technology, allowing him to remove many of the restrictions Apple places on its users. Today, songs purchased from Apple’s iTunes store can’t be played on non-iPod devices, and, if you’ve bought songs from other music stores, the chances are you won’t be able to play them on the iPod either since they use a form of copy protection that Apple doesn’t support. […] Johansen’s driving force is his belief that users have the right to listen to songs they have bought legally on any device they own. […] Unlocking the iPod-iTunes ecosystem is seen by many as a good thing for consumers, as it will most likely result in increased competition to the iTunes Store, possibly resulting in lower prices and a higher quality service.” No particular comment here, except that I’m pleased: the iTunes/iPod black box has gotten on my nerves more than once.

Otuel and Roland, and Scandinavia.

Posted on at 10:53 by vika. Categories: phd - mechanics, rolandht.

Work is getting easier – sitting down and actually working, that is, as opposed to dreading it and feeling guilty about not doing it. I’ve been on the same primary source since last Thursday, but it is big (over 2700 lines), so I have sixteen whole excerpts from it. Only the Song of Roland has more excerpts. Plus, this one (Otuel and Roland) is in Middle English. Instead of translating it – at which I’d do a miserable job – I’ve written a mini-guide on pronunciation that should take the reader pretty far, and am encoding translations for the particularly obscure words using the glossary at the end of the book. This is adding a lot of encoding time, but should be cool if I can figure out how to make the translations appear on mouseover. (If I can’t figure out, there’s always Ethan to beg for help, but if it can be done with XSLT/CSS, I shouldn’t need to.)

Right. To work.

[Psst… Livejournal readers – just a reminder that if you comment on the feed, I don’t get notified, and at the rate things are going, am unlikely to go back to past posts and check to see whether there are any comments. Instead of clicking on “leave comment,” click the URL for the post, and you’ll be magically transported to a comment interface on Words’ End.]

Federal gummint and marijuana studies!

Posted on 23 October 2006 at 20:28 by vika. Categories: politics.

Top 10 Pot Studies Government Wished It Had Never Funded.

That’s excellent. Now I can point people to one handy link when they start spouting bullshit about marijuana use, users, function as a Gateway Drug ™, etc. without actually knowing anything about its socio-political history.

If wishes were babies…

Posted on 22 October 2006 at 12:09 by vika. Categories: big wide world, family, news.

Via the inimitable Ms. Bitch comes a link to a Washington Post article (free registration required, I believe): “As Europe Grows Grayer, France Devises a Baby Boom.”

France has woken up to a bit of population crisis going around Europe: all of Europe is below replacement rate, meaning the population count is going down. In addition, they like families. You know, have family values. I remember hearing something about that in the U.S., vaguely and only once or twice.

Some excerpts from the article:

When the municipal day-care center ran out of space because of a local baby boom, the town government gave Maylis Staub and her husband $200 a month to defray the cost of a “maternal assistant” to care for their two children.

When Staub delivered twins last December — her third and fourth children — the nation not only increased their tax deductions and child allowances, the government-owned French train system offered 40 percent discounts off tickets for the parents and the children until they reach their 18th birthdays.[…]

France heavily subsidizes children and families from pregnancy to young adulthood with liberal maternity leaves and part-time work laws for women. The government also covers some child-care costs of toddlers up to 3 years old and offers free child-care centers from age 3 to kindergarten, in addition to tax breaks and discounts on transportation, cultural events and shopping. […]

A century ago, France was one of the first European countries to face a declining population. Since then, almost every elected French government — regardless of party — has instituted laws that encourage bigger families and make it easier for women to keep their jobs while raising children.

Now that’s family values.

Technological wonders and peripheral lucidity.

Posted on 19 October 2006 at 21:11 by vika. Categories: phd - mechanics, rolandht, strangeworld, taking it personally.

Ethan’s taken geeky anti-vandal measures. Plus, we now have a set of functioning motion-sensor floodlights. Come back, kid. I want you to show your face.

This repeated-senseless-violence thing has been… distracting; I had been unsuccessfully trying to work for two days and instead somehow getting sucked into the WaiterRant archives again and again. But lo, as soon as I sit down to read/annotate some primary sources (instead of writing the second chapter, which is due – oh – at the end of the month), work gets interesting again. Go figure.

Reading and annotating, in this case, is a lot of pattern-searching. All afternoon and evening my peripheral vision has been crazy-sensitive. I wonder if the two are related.

Don’t have any more words.

Posted on 18 October 2006 at 19:58 by vika. Categories: outrage, taking it personally.

He fucking did it again.

One un-smashed window remains in our car.

He’s some punk kid. I saw him but, of course, not his face.

Fuck you, and your little hammer too.

Posted on 17 October 2006 at 23:38 by vika. Categories: strangeworld, taking it personally.

Some bored soul has busted in the rear windshield and one of the back door windows of our car tonight. Probably with one of those little hammers that come in safety kits: the windshield is broken in the middle and also around the edges, but not in-between – and there are no marks on the metal around the edges, so it probably wasn’t a sledgehammer.

Fucking punks. Completely pointless vandalism. Unless they know us and are trying to make some sort of a statement, which I highly doubt: nobody really knows us around here, except by passing hellos. I hate this sort of shit, this mindless mean-spiritedness. [ETA: They didn’t take the ipod that was in the glove compartment, or my passport, or the road atlas. Really pointless.]

Now, instead of angsting about my dissertation tomorrow, I get to call the insurance company and sweep up broken glass, and maybe drive the car over to some repair place or other (or maybe wait until they get the part in, since these cars haven’t been sold in the States for that long). Plus there’ll be a deductible-sized hole in our budget that we just didn’t need. All of it because some idiot broke stuff for ha-has.

repeating work patterns.

Posted on at 11:39 by vika. Categories: phd - mechanics.

Work on the thesis is being done in spurts. Partly of necessity: I’ve managed to schedule myself for five out-of-town events, in four trips, all within the same two months. Plus there’s a possibility of getting to spend time with the Nephew, who is turning into an excellent if willful little person, and while I like having him over, I’ve discovered that zero work gets done when he’s around. I suspect it’d be different if we lived closer together, but so it goes.

Anyway, my mental pattern with regard to thesis work has been repeating – rather predictably so; this pattern has existed since long before Roland. It goes something like this:

Before the work period starts: attitude cavalier, anxiety far at bay. Usually, during this time things are happening that make me feel good – conferences, family time, just-breaks with good books.

First day of work period: attitude of “ok, here I am buckling down.” Permitting myself to spend this one day Organizing, which never takes just a day, so the day almost inevitably ends in a vague state of many things accomplished but not enough, damnit.

Second day: overwhelmed and in denial. Repeat mantras of the “I’m smart enough and diligent enough to do this in time – but it won’t happen if I keep succumbing to the anxiety and denial, for they evilly drain energy, confidence and time” sort. Keep having to remind myself that I love this stuff (and I do, it’s the time pressure that’s a bitch to deal with).

From this second day on, if I manage to get myself to start working sometime before 10, life is good. If I don’t, I lose days to self-loathing, or thought patterns less dramatic but just as draining.

The hardest thing is not knowing how long this thesis will take, or how much work it will be. I’ve set myself a hard limit – graduate next spring – but what if I don’t get the work done? If I could see the steps clearly, it would be easier to work. As it stands, it’s hard to even know which large swathes of work will turn out to be useless for the current purposes. There’s no way I can do justice to the Roland corpus in the course of this dissertation; defining its limits in a field of material that I don’t know that well (there’s SO MUCH of it!) seems like a futile exercise.

Outlines don’t help, either. They take so much time to make, and then I have to change them a million times over. My working outline helps me organize whatever it is I want to do next, I guess. But since it isn’t representative of the final thesis structure (as I discovered rewriting chapter the first), it’s not an indicator of how far along I am.

These are some of the things that make thesis writing hard – I’ve read and heard this from many sources. The resultant anxiety is a pest, and I resent it for that. Go. Shoo!

CaSTA: the closing.

Posted on 14 October 2006 at 11:54 by vika. Categories: digital humanities.

Whew, that was grand. Just one thing about the closing panel discussion, while it’s fresh in my mind.

This year’s CaSTA was billed as “a joint computer science and humanities computing conference.” And it was! And [we saw that] it was good. Of the five keynote speakers, three were humanists and two – computer scientists. The final discussion was called “Humanities Computing Science??”.

William Arms, in his remarks during the panel, said that during the conference a word was frequently used that isn’t generally used in his usual [computer-science] circles. That word – knowledge. He, and just about everyone at the panel, said that what they primarily want from the “other side” is dialogue.

In light of that, what I’d like to see in this continuing dialogue is a bit of discussion of the word science. As it’s been used lately (in the last, what, 200 years?), it implies “HARD.” Humanities implies “soft.” That’s a major point of contention.

But given that “science” pretty much means “knowledge,” should we revisit our use of the word?

Siemens on REKn

Posted on at 10:30 by vika. Categories: digital humanities.

Ray Siemens is a computing humanist and Renaissance scholar working at the University of Victoria. The full title of his paper is “Knowledge management and textual cultures? Work toward the Renaissance English Knowledgebase (REKn, pron. “reckon”) and its professional reading environment (PReE).”

REKn seems to be aiming to amalgamate and integrate knowledge in its area. Its implementation is based in the study of disciplinary activity of, and professional interaction among, those in the humanities. It’s founded in concepts of knowledge representation and modeling. A short description of the project can be found here.

Knowledge representation: draws on the field of AI and seeks to produce models of human understanding that are tractable to computation. Modeling: REKn/PReE model data, intellectual processes, and beyond.

Key elements of REKn’s model:
- representation of archival materials
- analysis/critical inquiry originating in those materials
- the communication of the results of these tasks (the dissemination of primary and secondary materials)

REKn’s assumptions: all of the above are interrelated and inseparable, and electronically representable.

They’ve collected primary and secondary sources, and have built tools for working with them (the tool-building process seems to have been multi-stage: many tools built and discarded as inadequate). They’re looking to long-term partnerships with Renaissance materials providers in the future. Right now REKn has about 13,000 primary sources and over 80,000 secondary sources. About 1500 of these resources are currently available for public use, but the majority are not open-access.

So that’s REKn, the text base. What about PReE, the reading environment? It’s a rudimentary document viewer, and analysis and communication facilitator. Currently the UI is a “down-and-dirty prototype,” as they’ve been concentrating on making things work in the back end. [vz: he’s showing PReE in Windows; I wonder what it’s written in.] They’ve made several analytical tools for the encoded texts. Primarily, though, analysis will be carried out using TAPoR tools.

Communication facilitated electronically: they’re attempting to provide a system by which people can manage their professional interaction.

Short-term goals:
- integrate better with TAPoR and the Public Knowledge Project reading tools
- conduct usability studies
- consult with “contextual” stakeholders, including acad. publishers
- move prototype to a web environment
- scale up!

And that’s Ray’s talk, the last talk of this conference. Next is the panel discussion, titled “Humanities computing science?”. The panel will consist of the five keynote speakers. I’m not sure whether I’ll be taking notes on this; it’ll be video recorded, and there’s little probability that I’d do it justice. Again, I’ll update when the webcasts are up.

Lyman on the electronic Piers Plowman

Posted on at 10:07 by vika. Categories: digital humanities.

Eugene Lyman is an Early English scholar at Boston University. The full title of his paper is “Presenting an electronic critical edition of Piers Plowman B.” The B refers to one set in a multitude of manuscripts of this Middle English alliteraative poem, comprising only one of its versions.

EL frames his talk with the following two quotes:

“Computers compute, of course, but computers today, vfrom most users’ points of view, are not so much engines of computation as venues for representation.” –Matthew Kirschenbaum

Understanding the poetics and principles of electronic scholarly editing means understanding that the primary goal of this activity is not to dictate what can be seen but rather to open up ways of seeing.” –Martha Nell Smith

Lyman has created software that allows you to look at the existing manuscript pages from ten different manuscripts, enlarge portions of those pages, read the transcriptions, look at erasures that tell us interesting things about how people might’ve edited in medieval England. You can search within the dataset, visualize the text in various ways, view the underlying XML markup… yesterday (or was it the day before?) EL actually gave some of us an informal demo of this thing, and it is sweet.

How to make a continuity of presenting a single text that exists in multiple manuscripts?

EL created the Elwood Viewer, which looks at documentary editions of texts. Its aims:
- tight coordination of text and image
- visual cueing to guide/reinforce reader’s attention
- handy tools, all within a metaphorical arm’s reach
- ease of navigation, especially at opportune moments
- parsimonious use of screen real estate
- simple, no-cost programming environment, open to change [this is all JavaScript, I think… -vz]

The software allows you to extract data for further analysis. One such analysis called into question the notion that scribes were totally random with respect to the ways in which they encoded [marked up, oh yes, for markup exists in many forms including punctuation and embellished first letters] their texts.

This is all a cricital edition: you take a group of witnesses, compare them, note the variations, and choose the variations you think were on the author’s agenda when she was writing the text. The notion of a critical edition, which effectively “leaves behind” the actual manuscripts, is pretty controversial; nevertheless, at the moment critical editions have a lot of weight in literary studies. [vz: I’m leaning toward the camp that’s sceptical of critical editions, but only in the sense that I don’t tend to consider them definitive, while many others do.]

EL has a prototype version of the critical edition. I haven’t found any screenshots of it on the web, but here’s the Piers Plowman Electronic Archive hosted at the University of Virginia. It seems more a worksite than a ready-made resource, but is worth poking around.

Drucker on visualizing interpretation.

Posted on at 8:53 by vika. Categories: digital humanities.

Johanna Drucker is a professor of media studies, and a founding member of UVA’s Speculative Computing Lab. The full title of her keynote is “Graphic conventions: visualizing knowledge and subjectivity.”

Can we shift from information to interpretation, and shift [back] to a more humanistic point of view within digital humanities? To do that, we have to re-introduce subjectivity, and maybe substitute the mechanistic with the probabilistic (the latter being humanists’ worldview, according to JD).

Visual information conveys information in a form that makes it very hard to analyze systematically. Two ways to create stable knowledge in a notation system: one is with [natural] language, the other – mathematical notation, said Rene Toms of the Oulipo. He never did talk about visual representation, and with good reason: it’s an unstable notation mode.

Subjectivity comes in two forms: position (structural) or inflection (semantic).

The notion of information comes from a particular set of assumptions of what knowledge is. JD not interested of getting rid of this model of knowledge; but rather to propose another model of knowledge.

Visualization is compact, the problem isn’t having enough space to represent information – it’s pinning down the exact nature of the information and our assumptions about its aspects.

Visual information (VI) can lie or misinform, like natural language can.

The silliness of chunking of processes (how authors write stories: “author thinks about a topic” –> “author sketches an outline” –> “author reviews the sketch”) is apparent, but we have to do this sort of chunking when we’re working in a computational environment, which requires discrete units. Because of schematics’ rhetorical power, we eventually come to believe them.

So what about text visualization (as opposed to data visualization above)? Interest in doing things to and with texts is very active, especially within the creative-writing communities. Like TextArc, that sort of thing. They can be silly, ugly, destructive of the original, and yet they have their uses.

Edward Tufte, the exquisite engineer according to JD: information pre-exists visualization. Visualizations can be transparent enough to get us access to information. JD disagrees: visualizations are interpretive, opaque, distortive. They create informmation.

Temporal modeling at SpecLab. Basic assumption: timelines as they are conventionally defined and designed come out of the empirical/natural sciences. Assumptions there: time is unilinear; time is homogeneous (metric is stable); time is continuous (no unbroken intervals in temporality). None of these three things hold. Temporality branches in our lives, in poetry/film/etc. Time is not homogeneous – some moments fly by, others are long (the moment before the kiss and the moment after are very different, JD says). Time is not continous, either: there are breaks/ruptures, recorded in historical accounts for example.

SpecLab constructed a grammar of inflections, of visual elements they’d use to represent time, types of events and their relations to each other. There’s a lot of information JD is giving about what SpecLab has been doing; I’ll point you to the Lab’s site instead of summarizing.

In the IVANHOE game, every action takes place from within a role. Each role has a set of assumptions that go with it. They employed it as a teaching tool at UVA, with the purpose of showing that, in fact, every action stems from a set of presuppositions. [vz: that can’t be right, I’ve captured too simplistic a description. Go see the site for more.]

Subjective meteorology: JD’s current project. An art project, which JD says – duh, art is in the humanities. [vz: yay!] She charted and graphed and visually represented a bunch of weather patterns – lines of anxiety/anticipation, storms of anger – which look gorgeous on the slides but don’t seem to be on the web. These representations can be chained together and animated to playfully and visually represent one’s subjective perceptions of the world around.

Great discussion follows. I can’t pretend to capture it well enough; I’ll post an update when the keynote webcasts are up, and urge anyone interested to watch this one when it’s available.

Hoover on CaSTA and breadth.

Posted on 13 October 2006 at 16:06 by vika. Categories: digital humanities.

David Hoover is at NYU, and is the Vice-President of the Association for Computers and the Humanities. The full title of his paper is “CaSTAing breadth upon the waters.” (”Cast thy bread upon the waters: for thou shalt find it after many days,” say Ecclesiastes 11 – hence his title.)

DH seeks simple methods for examining word frequencies in corpora of single authors, at different stages of their production. Do authors tend to start disliking (using less) words they used to like (use a lot) earlier in their careers? Or vice versa? How does an author’s vocabulary change over her production’s life? DH does a lot of statistics to try to find out.

He’s talking about Trollope, about whom I know nothing. Apparently, the 100 most variable words in his corpus are all proper names. They also all appear in more than one stage of his writing career (early-middle-late). Henry James, however, has some non-proper nouns.

DH’s project is very much in progress; he says it’ll be a while until he has something interesting to say about the evolution of writerly language. One interesting question is: if you have a writer who starts writing very young, does their vocabulary change quickly, early on? What about writers who write far into old age?

One interesting consistency in James is that he seems to have used fewer “precious” nouns as the years went on: words like coquette and tresses.

Cunningham on the Arte of Navigation

Posted on at 15:34 by vika. Categories: digital humanities.

Richard Cunningham is at the Acadia University English department, and directs the hypermedia center there. The full title of his paper is “Developing digital navigation from The Arte of Navigation.”

Readers experience something different reading an electronic document as opposed to a paper one. [Glaringly obvious, RC admits.]

RC presents the Acadia Digital Culture Observatory. They have digitized a 1561 edition of The Arte of Navigation and want to observe how readers read and use it. The original text included a navigation instrument made of three concentric paper circles of different sizes (volvelles), which are to be overlaid one on top of another and rotated. Here, you can see it for yourself. (Hm, it doesn’t seem to work in Firefox on Mac in Blackletter mode; I suggest the use of Arial to be safe, or you can download Blackletter from their table of contents.) Check out particularly the navigation instrument Flash files in the “other moving images” section of the TOC, they’re fun to play with.

TAPoR on TAPoR.

Posted on at 14:23 by vika. Categories: digital humanities.

Ray Siemens of Victoria hosts a session of three papers related to the Text Analysis Portal for Research. First we have Geoffrey Rockwell, with “Text empires: text analysis in excess.” Shawn Day will talk about “The use of the recipe as a guilding metaphor for flexible and efficient self-guided computing instruction.” Finally, Stéfan Sinclair will talk “On data & views in text analysis.” All three presenters are from McMaster University in Hamilton, near Toronto.

ROCKWELL.

Information overload: 5 exabytes of information created in 2002. Exabyte = 1,000,000,000,000,000,000 Bytes. It’s a thousand petabytes, or a million terabytes. [Holy wow.] Spam is cheap, but reading has costs. How can text analysis help?

Why this explosion of information?
- growth in population and wealth: more money, more media toys
- multiple-media, from the photograph (1820s) to the iPod
- digitization of information and business practices: cheap creation, storage, reproduction, and transmission

Challenges to the system: what are the effects?
- experience of information overload
- multimedia shock
- narrowing expertise (because nobody can’t keep up with a broad discipline!)
- archive fever

What can we do?
- understand the problem (literary dimension to it; a problem of scale, a bibliographic problem)
- produce less? [shock! I can hear the internal gasps around the room!]
- file and (not) store smarter
- find smarter (not more) [ooh, I’ll quote him in my dissertation work! no, I cannot, in fact, process all the litcrit written to this day]
- learn to read differently

The latter two of the above are opportunities for text analysis.

Problem of scale to text analysis for finding and reading:
- heterogeneous formats and multimedia rich
- closed (”for perfectly reasonable reasons” -GR) information empires (Google) build on existing indexes or build their own
- new questions, research methods (data mining and visualization)
- text analysis tools developed for coherent texts (collaborate with data mining & HPC [high-performance computing] community)

TAPoR.2 model, Beyond Finding and Reading:
- gathering and aggregation function (working with existing empires like Google; create your own study library (myEmpire))
- mining function (clustering and classification; provoking questions, not finding)
- interface and visualization function (effective interactions for research)

DAY

They’re using the recipe metaphor to get people of different backgrounds to use TAPoR.

A recipe for self-guided instruction:
- ingredients
- steps
- glossary
- discussion
- further information

Ingredients:
- ingenuity
- a useful metaphor
- a versatile set of tools
- users desirous or willing to consider using said tools

Steps
- identify objective
- consider users’ needs
- develop case studies that describe how your tools can meet these needs
- apply a familiar metaphorical approach to engage and instruct
- deploy recipes through a wiki

Glossary
- recipe: a useful guiding metaphor that offers optimal flexibility…. [couldn’t get it, too fast]

Further Information:
Try the recipes out! (For example.)

Nice, familiar, easy concept. As Shawn is pointing out right now, super easy to engage a beginner user. This could be very useful, as well, when getting folks used to traditional humanities research methods to try, say, text encoding.

SINCLAIR

[Stéfan is the creator of HyperPo, the coolest text analysis tool ever so far.]

Generally, there’s a one-to-one mapping between tools and the data views of their results. SS has been thinking more in terms of this progression:

text -> tool -> data (TAML) -> style -> view

Among other things, he wanted to create a framework to use in teaching the development of text analysis tools in a modular way.

It’d also be nice to be able to chain tools together – you run a tool on a text, get the resultant data and feed it to another tool, and so on. This requires tools that can ‘talk” to each other, and output data in the same (or similar enough, or easily translateable) formats.

HyperPo 7.0 is coming soon!

Arms on vast amounts of data.

Posted on at 13:16 by vika. Categories: digital humanities.

William Y. Arms is a computer scientist currently working at Cornell. The full title of his keynote is “Humanities and social science research using vast amounts of web data.”

Examples of very large collections:
- Library of Congress: National Digital Information Infrastructure and Preservation Program
- The Internet Archive’s historical collection of the web (600 TB, terabytes)
- Large scale digitization projects: Open Content Alliance, Project Gutenberg, Google, Microsoft, Yahoo, etc.
- USC Shoah Foundation: Survivors of the Shoah (400 TB)

How will humanities and social science scholars do research on collections which are large by supercomputing standards?

“Only the computer reads every word” –Greg Crane
- Researchers interact with the collections through computer programs that act as their agents.
- Users rarely view individual items except after preliminary screening by programs.
- Collection requires a highly technical computer system that is used by researchers who are not computing specialists.
- The collection is a high-performance computing system.
- Use of the collection depends on automated tools, which require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.)

Example: the Cornell Web Lab (or is it a Library, asks Arms?)

The structure of text:
Manual analysis and mark-up
- skilled bibliographers and cataloguers
- manual textual markup
- semantic web tools for representing relationships (e.g., RDF, Fedora)
Semi-automated methods
- automated name recognition under human control (e.g., Perseus)
- expert-guided web crawling (e.g., iVia)

The above are tens of millions of records. How do we manage billions of records?

Example: The Internet Archive web collection
The data: complete crawls of the web, every two months since 1996, with some gaps:
- range of formats and depth of crawl have increased with time
- no data from sites that are protected by robots.txt or where owners have requested not to be archived
- some missing or lost data
- metadata contains format, links, anchor text
- organized to facilitate historical access to a known URL (Wayback Machine)

The research dialog between a scholar (S) and a computer scientist (CS) goes something like this:
S: Here’s a study we’d like to do…
CS: We don’t know how to do that analysis, but would this be any use to you,
S: Not as you suggest it, but here’s another idea…
CS: That might be possible, with the following modification…
BOTH: Let’s try it and see!

Eventually we get something that is both useful from a research point of view and feasible from a computing POV.

Social Science Research:
- the web as evidence of current social events (spread of urban legends; development of legal concepts across time)
- the web as social phenomenon (political campaigns, online retailing, polarization of opinions)

Research topic example: social and information networks, joining a community. Question: what is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters? New behavior could be: adopting a new technology, joining a club, etc.

So, when everything is in digital form, will the library go from being the largest building on campus to being the largest computing system on campus? WA says there’s a good likelihood of that.

WA goes on to describe some of the projects on the Web Lab’s plate right now. Their descriptions can be found on the Web Lab site.

Policies issues on the use of the lab: custodianship of data; copyright; privacy.

Design guidelines for builders of large digital collections:
- every online collection or service needs an application program interface (API) for computers, not humans, to interact with the library.
- a primary methodology is: select a subset of the collection; download to researcher’s computer; use programs on the researcher’s computer to analyze the data.
- almost all metadata will be computer generated, but human cooperative editing can correct errors.

Pytlik Zillig on TokenX

Posted on at 11:17 by vika. Categories: digital humanities.

Brian Pytlik Zillig is an all-around digital-library tech wizard at the University of Nebraska-Lincoln (UNL), which hosted the first annual Digital Humanities Workshop a few weeks ago. The full title of his paper is “TokenX: a text visualization, analysis, and play tool designed for the XML document tree.”

Some history:
- CDRH and other digital centers significantly rely upon XML
- XML is a 1998 [whoa, old] recommendation [hunh, not a standard] of the W3C
- XML is a robust and flexible medium for content
- UNL has been using XML/SGML since 1998
- all CDRH projects use XML

Research question, born in 2004:
- can emerging standards assist in text visualization, analysis and play? (for example: XSLT)

BPZ’s goal:
- use XSLT to explore text visualization, analysis, and play (TVAP)
- provide TVAP options useful to facilitate the creative, qualitative, and quantitative exploration of XML text

Why another text analysis tool?
- there are good tools available written in a variety of languages, but none are created in XSLT, and none that takes advantage of the special relationship between XML and XSLT

Say we take a Shakespearean sonnet line: “when to the sessions of sweet silent thought.” You’d be crazy to try to try to mark up every word in XML, it’d be a huge undertaking. But XSLT 2.0 can add markup to words using tokenization! Way cool! Tokenized, each word will look like this: <w>word</w> – and here’s a punctuation mark: <nonWord>,</nonWord>

With this markup, XSLT can be used to do a variety of TVAP actions on a text. TokenX ingests XML documents, retaining the original markup, and adds tokens like the examples above. Visualizations include word highlighting, keywords in context (looks a bit like a concordance), replacing words with blocks (for example, to find words that are too long?), highlight punctuation and non-words, all kinds of stuff.

Here’s the TokenX site, if you’d like to play with it.

Analyze, in the TokenX context, means:
- count words in context
- decontextualize words and count them (frex, list all the words in the document alphabetically, or by frequency, each word only once with a number of its occurrences next to it)
- word statistics (how many words, how many elements containing words, mean number of words per element)
- punctuation and non-word statistics

TokenX exports into spreadsheets, so you can export and save your dataset.

You can play with TokenX:
- substitute words
- replace words with images

Best part: it’s free and open-source. “You can change it!” Brian exclaims. Excellent.

Wulfman on the Modernist Journals Project

Posted on at 10:46 by vika. Categories: digital humanities.

Cliff Wulfman is working at Brown – lucky us! (Major shout-out to Cliff.) The full title of his paper is “The Modernist Journals Project: A new architecture.”

Here’s the MJP site. It’s evolved from a quite small-scale faculty project. Cliff talks about how to take one of those and move it toward technologies that will allow it to grow and expand and move at a healthy pace. Their primary-source set is pretty large: modernism grew up largely in periodicals, and they’re digitizing them and putting them online.

Complete runs of magazines are scarce, CW says. Even when they exist, oftentimes the advertising has been stripped.

The MJP started out with a desktop scanner and an OCR (optical-character-recognition) package. One office, one faculty member, several students. That’s all. They tried to digitize all 30 volumes – nearly 18,000 pages! – of The New Age (”a weekly review of politics, literature, and art”) that way.

The limits of the original implementation: it was labor-intensive, hand-scanned and hand-coded; and it was served through eclectic, hand-made HTML pages. The MJP was outgrowing the prot in which it was seeded, CW says. It needed:

- engagement with the concept of “cyberinfrastructure”
- embrace of new technologies, standards, best practices that weren’t in place when the project was first conceived.

So they stepped back and devised a new architecture:
- complex digital objects based on digital library standards (METS, MODS, MADS)
- XML substrate
- data- [?] and database- driven service
- polymorphous delivery: can deliver in formats other than PDF

We then had a demo. Go look at the site for more. :)

Future directions:
- access to new scanner technolgoies will enable vast collection growth
- developing an interlinked encyclopedia of modernism
- build on Fedora’s digital library infrastructure

Hirtle on TRANSLATOR.

Posted on at 10:25 by vika. Categories: digital humanities.

David Hirtle is doing graduate work here at the University of New Brunswick, in computer science. The full title of his paper is “TRANSLATOR: a TRANSlator from LAnguage TO Rules.”

Semantic web is still n ot widely used.
- Focus of current development: machine-readable (meta)data
- Problem: only experts can contribute. Need to lower barrier to entry.

Provide a user-friendly format!
- why not English [he really means natural language]?
- “controlled English” avoids ambiguity: it’s formal, but also natural

TRANSLATOR will translate “every student gets a discount of 15 percent” to express [in XML, from what I see] that “student” implies “customer,” etc.

ACE (attempto controlled english):
- looks like English: “every honest student who does not procrastinate receives a good mark and easily passes the course.”
- but actually a formal language, like RDF: a tractable of English – all ACE sentences are English, but not vice versa
- every ACE sentence can be unambiguously translated into logic.

Strategies for handling ambiguity:
- exclude imprecise phrasings (”students hate annoying professors” – do they hate to annoy profs, or do they hate profs who are annoying?)
- interpretation rules (”the student brings a friend who is an alumnus and receives a discount” – who receives the discount? in ace, by default, it’s the student because of a certain rule. If you want it to be the alumnus, you write “…and who receives a discount.”)

How can rules be expressed?
- in natural language, many different forms (everyone is mortal, all humanity is mortal, for each person the person is mortal)
- all above are valid ACE
- further embellishment (negation, relative clauses, etc) [vz: but doesn’t that add ambiguity?]

What can’t yet be easily expressed?
- “infix” implication (”the student is happy if there is no class” – solution: TRANSLATOR swaps the condition(s) and conclusion(s) and voila, ACE-acceptable)
- production and reaction rules (involve actions: “if a student is caught cheating then send a report to the registrar” requires the imperative mood, which is not yet in ACE)

Discourse representation structures, and more technical info. Sad, I can’t reproduce his diagrams here. The rules are eventually translated into RuleML, in whose development David is participating.

RuleML:
- goal is interoperable rule markup (XSLT translators to other semantic web languages)
- family of “sublanguages” (modular XML schemas; each represents a well-known rule system; TRANSLATOR uses First-Order Logic sublanguage)

Why use RuleML?
- ease of interchange (XML)
- compatibility with RDF and other languages, as well as W3C’s upcoming Rule Interchange Format
- availability of tools
- wide fariety of features (negation-as-failure, weightings, data types etc.)

Again, work-in-progress. Truly an attempt at getting closer to the semantic web. Formalizing natural language, what a gargantuan task. One critical benefit of TRANSLATOR is that it “allows non-experts to write facts and rules for the semantic web.” When can we play with it?

Now, it appears. Here’s a site for TRANSLATOR, including a Java Web Start demo.

Lukon and Juola on building an index generator.

Posted on at 10:00 by vika. Categories: digital humanities.

Shelly Lukon and Patrick Juola are both at Duquesne University. The full title of their paper (presented by Lukon) is “Designing a context-sensitive machine-aided index generator.”

Problem definition.

Back-of-the-book indexing provides relevant terms, identifies cross-references and subcategories, and has a static, rigid structure (as opposed to web indexing). Human indexers invest a LOT of time into indexing (1 week per 100 pages of text); use software to automate mundane texts; and make all the intelligent indexing decisions. SL&PJ’s prototype system bridges the gap between the human and currently available tools, but not to replace the human indexers.

They’ve interviewed professional indexers, product-tested some of the software packages they tend to use, and looked at some mathematical techniques (particularly LSA, latent semantic analysis) that have had proven success in text processing and capturing semantic content of terms.

Cognitive tasks involved in index construction:
- identifying terms to index;
- locate all informative references;
- identify/locate synonymous terms;
- split index terms into subterms;
- develop cross-references within text;
- compile page numbers.

Their techniques for obtaining semantic information:
- parsing/tagging of terms, frequency analysis
- LSA
- word sense disambiguation (WSD)
- hierarchical cluster analysis (HCA)

This is still a work in progress. So far they’ve been able to locate all informative terms in text, and to allow the user to set thresholds/parameters. LSA, WSD and HCA show first level of clustering nearly 40% accurate upon inspection (not great, but a solid start). Their single-processor PC takes several hours to process small (60K words) corpora. Better than a human’s speed!

They’re categorizing words into parts of speech: identify the part-of-speech of each term; label each term with delimiter and acronym (home becomes home/NN since home is a noun). They’re only dealing with English right now. Their app is written in Java, as is MontyLingua which they’re using for part-of-speech tagging.

LSA:
-use factor analysis to generate numerical representations of terms and their meanings;
-divide corpus into “documents” (paragraphs), then analyze each unique “term” (word) relative to each document;
-create term-by-document matrix;
- create term-by-term covariance matrix (look at how each pair of terms vary together)
- singular value decomposition (SVD) - a way of explaining variability among random variables (dimensions)
- decompose covariance matrix into three submatrices [over my head here]
- rank resulting values
- reconstruct using most significant dimensions (reduce noise, sharpen similarities/contrasts)
- 200 most significant dimensions: pinpoint each term’s location in 200-dimension “semantic space” [why 200?]

WSA
- separate out different senses (meanings) of each term token
- numerical encodings generated by LSA give average context for each term token
- look at encodings of the other terms surrounding each occurrence of the token
- Example: the word “bass” occurs throughout text (both as fish and as musical instrument), proximate to other words (guitar, boat, fish) that help disambiguate
- disambiguate “bass” into “bass_fish” and “bass_instrument”

HCA
- partition terms into subsets with similar properties/characteristics
- antonyms as well as synonyms will cluster together (both have strong relationships, but the system doesn’t know whether they’re positive or negative)
- this information can be used to identify cross-refs (see also) and subterms

This is a machine-aided system. Its purpose is not to replace but to assist the human indexer, whose judgment and experience cannot be fully captured by a sophisticated expert system. Users can edit results at any stage, control indexing parameters, etc.

Metrics for evaluating the “goodness” of the resulting index:
- side-by-side comparison between entirely-human-generated and machine-aided indexes of the same dataset, quantify what percentage of agreement is acceptable, maybe find meaningful information in how they disagree as well.

Future work:
- incremental refinement
- system has modular architecture for ease of swapping out individual components
- need robust, effective user interface
- empirically vary frequency thresholds, weighting methods, number/percentage of dimensions to use in the reduced data matrix
- continue to build in the latest/most efficient indexing/retrieval methods.

What a great project. I’d love to use it for RolandHT, but it probably won’t be done in time. Enabling the software to read/process XML is on their wish list of big enhancements, hooray!