jump start

Posted on 30 May 2008 at 0:47 by vika. Categories: blogging, burning man, community, digital humanities, digital library, love the world, people, self, travel, work.

Been a while since I’ve blogged publicly, hasn’t it? Hello, again.

I go to write this post, and notice a new comment from Regina, an old friend from Moldova who now lives in Israel, with whom I’d fallen out of touch a while ago. Holy cats. Hello, again. It’s lovely to hear from you.

(The timing of the comment and of my being compelled to write here again are a coincidence.)

Yeah, there’s been a lot of sadness that I’m not quite ready to write down. Luckily, the last month or so has also been filled with joy and light and smart people and work (hooray, work!), so it’s not like there’s nothing to tell.

My job at Boston University, the title of which has now settled at Digital Collections and Computing Support Librarian [in the School of Theology], rocks my socks so far. It’s not that I’ve done a whole lot, yet; it’s only been a month, and the end of the academic year at that, and my boss the head librarian has been out on vacation for the past two weeks, so things are relatively slow. On the other hand, there’s plenty to do in the computing-support half of the job. I’ve been learning [more] about how BU’s network is set up, which is nifty. We’re purchasing a big pile of equipment to replace old stuff – both servers and personal workstations for faculty and staff – which, you know, from the support standpoint is great. Soon there’ll be no more @$#%! five-year-old Dells to support, and many of the four-year-old machines are going away too. People are open to the idea of Macs, which is huge in such a behemoth mostly-Windows org. (BU is an immense bureaucratic machine, and I say that with all the affection that one would expect a girl to have for her alma mater.)

Best of all, people want to learn. I’ve been getting to know the faculty and staff. Some of them are already doing digital humanities projects (like the History of Missiology site). Others have cool ideas (hello, Admissions Director using Facebook in all kinds of cool community-building ways). And still others want to figure out how computing can make their research and teaching (and administration, and the school as a community) more awesome.

This is what they hired me to work on. I’m unspeakably excited. Yeah, so far it’s been all support and no digilib, but I expect that to change. There’s a lot of hardware overhauling to do, and some basics to catch up on. That will take some months. But there’s already so much concrete investment of time, thought and resources in digital library stuff at STH that I have no doubt it’s going to go somewhere interesting.

Then there’s life outside of work. That’s been filled with friends, children, loved ones, cats, cooking, Burning Man planning, hand drumming, sci-fi reading, Battlestar Galactica, water and fire and earth, casual photography, breathing deeply. And the weather’s been nice.

Yesterday I flew to DC. Today I participated in a day-long grant proposal review panel for which I read a total of thirty proposals, which took an unreal amount of time and was fascinating and instructive, and I’m not being sarcastic about any of that. The panel itself was great too; in the past month or so I’ve learned a ton about the grant review and award process, and I fully intend to use this knowledge for good. I have generalized thoughts on the whole thing, but have to formulate them separately – must wrap my brain around the whole thing first, and also make sure not to cross any confidentiality boundaries. The whole thing made me feel awfully important, and going away for just over 24 hours meant I could travel with just my work bag, light and easy.

Coming back tonight, at the Reagan Airport, I texted a friend something to the effect of, I like traveling – the interstitial part, the going – even more than being places. She laughed and declared me liminal girl. Certainly that holds true for my life in a larger sense.

There’s more, always – the children I get to hang out with, the surprisingly strong presence of love in my days, feeling so strong from weightlifting with one of my dearest, the USB turntable I bought with which I’m digitizing records from the old country – but it’s 1:45am, and tomorrow’s a workday. Er, today. Whatever.

changes in quotidia

Posted on 22 April 2008 at 5:54 by vika. Categories: digital humanities, digital library, self, work.

In an hour and a half or so I leave for the T (public transport!! don’t have to drive!) to start my new JOB.

Why, yes, I’m pretty thrilled at the prospect. The title keeps changing in the various documents they’ve sent, but the most recent one is Digital Librarian / Computer Support Manager at the Boston U School of Theology Library. The potential range of what my work days will involve is too large in my head right now, which means life will totally fail to be boring, and I’m likely to get flooded by information overload for the first few weeks, and I am so looking forward to that. I’ll tell y’all more when I get oriented. But… it’s an exciting job in my field. Holy cats.

Thus ends eleven months of unemployment.

On another note: praised be the sun and the moon and cycles, and spring. Did you know they’re saying 71 degree high today in Somerville?! And – just looked this up: there may or may not be a 5-degree difference between my house and my workplace at any given time. Well, of course: Commonwealth Ave is a wind tunnel, and work is also closer to the big water. Good to know.

MLA ‘07: Friday (1 of 2)

Posted on 31 December 2007 at 16:01 by vika. Categories: digital humanities.

Oh, Friday! Friday was a big day for electronic literature and the digital humanities (see this earlier post). A great session to go to was “New Reading Interfaces,” presided over by Rita Raley who knows how to get a discussion going. Here are some cool projects and topics discussed in this session.

Jeremy Douglass talked about tag clouds as an aesthetic medium. They are web browsing interfaces, and despite their name they’re usually organized alphabetically or by popularity. Douglass takes the idea of cloud and runs with it, exploring them as a creative medium: the example he gave of this was the Flickr Fiesta 2005 invite he received by email. Algorithmically sophisticated literary renderings like TextArc, where terms have geographic meaning may look like tag clouds, but the latter are much simpler, Douglass said; plus, TextArc isn’t searchable, whereas tag clouds are. Later in the session, he brought home the broad(er) point that the tag cloud isn’t just a utilitarian interface; it can be portraiture, for example when some blogs replace their mastheads with tag clouds.

Then Joseph Tabbi talked about the semantic literary web, mostly in the context of the ELO Archive-It MediaWiki, a joint project with the Library of Congress. How do you preserve something, Tabbi asked? Well, you can tag it, which is limited but useful as a field-building (as opposed to literary) activity. OK, so what counts as a literary interface? Clouds are interesting as conceptual art, but their literariness (found through reading) is limited. Tabbi talked about Electronic Book Review (ebr) as an example of experiments in literary interfaces: the ebr website gets completely overhauled every couple of years, sometimes with sub-optimal for readability results. The key, for Tabbi, is to find conceptual connections while reading, and cross-link, cross-categorize – both to writing within and outside electronicbookreview.com.

Elizabeth Swanstorm talked about the interface in Jeffrey Shaw’s installation piece The Legible City. This is one I would travel overseas to play with, given more time and financial resources. Here’s how Shaw himself describes it:

In The Legible City the visitor is able to ride a stationary bicycle through a simulated representation of a city that is constituted by computer-generated three-dimensional letters that form words and sentences along the sides of the streets. Using the ground plans of actual cities - Manhattan, Amsterdam and Karlsruhe - the existing architecture of these cities is completely replaced by textual formations written and compiled by Dirk Groeneveld. Travelling through these cities of words is consequently a journey of reading; choosing the path one takes is a choice of texts as well as their spontaneous juxtapositions and conjunctions of meaning.

Better: the latest installation is multiplayer! If people are using more than one stationary bike, they may encounter each other’s avatars in the virtual world. So each rider is a node in a distributed networked system; their actions influence others’ virtual world; and their physical surroundings, irrelevant, fall away. So what kind of interactor, Swanstorm asked, does The Legible City produce – readers, riders, writers? Her eventual thesis was that this project highlights textual analysis as something one does by actively interacting with the text. No kidding; imagine giving undergraduate students of literary writing and/or criticism the visceral experience of this installation. They’d have a different relationship with literature forevermore.

Finally, Victoria Szabo talked about teaching, reading and creating scholarly works in 3D environments. Specifically, she talked about how they (Information Science + Information Studies at Duke) use Second Life in teaching. Students create objects, hold events and collaborate on criticism virtually… oh, just look at the ISIS site I just linked to. Szabo’s overarching point was this architectural metaphor: building and thinking are closely related. They put this into successful ongoing practice over at ISIS, encouraging students to combine creative and critical acts in their use of 3D virtual worlds.

MLA ‘07: Thursday cont.

Posted on at 15:09 by vika. Categories: digital humanities.

Been home for 24 hours now, and I realize that I didn’t finish writing up the exciting stuff I saw at MLA on Thursday. So:

1. NINES, “a networked infrastructure for nineteenth-century electronic scholarship,” continues to impress with its impact and exemplary use of the net for collaboration. It arose, Laura Mandell said in her talk, in reaction to the prejudice against electronic publishing among tenure review, faculty search and other profession-influencing committees. The NINES editorial board not only aims to separate high-quality electronic scholarship from the chaff, but also do so in a sustainable manner. To that end, from what I understand they review sites and projects but leave things like copy-editing to authors themselves, ideally aided by their own institutions.

Laura’s point that the digital resources don’t, and can’t, disguise the human agency that creates them is worth repeating every once in a while. One of the ways in which electronic scholarship has been good for the humanities is that computation forces us to admit we’re constantly making choices, and some of these choices are arbitrary in that equally valid options exist for many editorial decisions. Objectivity as an aim falls away when you’re working computationally, and what’s left is a need to clearly explain your decisions. As we know from so many spheres of life, transparency is key communication. Scholarly communication is no exception from that.

2. In the same session, Robert Blake talked about the UC Language Consortium, which totally blew me away even if their site has been down for a few days now. They’re developing online resources for the teaching of foreign languages, starting with impressive projects in Filipino and Arabic. The consortium solicits proposals for development of these resources, and gives out small ($5,000-20,000) grants. The courses for which these resources are developed proceed to be open – for credit and all – to all students within the UC system, and the online materials are open to anyone to look at. Now that’s open courseware. And their next big project is Punjabi Without Walls! Apparently the Punjabi communities in the U.S. (and presumably elsewhere) are excited about this, since they want to keep their language alive and these materials will make that easier.

On to MLA Friday in the next post.

MLA ‘07: Thursday

Posted on 29 December 2007 at 14:29 by vika. Categories: digital humanities.

Highlights from Thursday:

The first session I went to was “The Challenge of a Million Books.” The title refers to computational mining of huge amounts of text at a time, in an attempt to discover bird’s-eye-view-level things we’d have trouble seeing with the naked eye. I discovered at this session that text mining is also called knowledge discovery. The latter is a term a bit too generic, I think: my encoded Roland excerpts also permit, even encourage, knowledge discovery, but what I’ve done with manual encoding and a simple interface is a far cry from sophisticated algorithms and machine learning.

Sara Steger presented on her research of sentimentality in nineteenth-century literature. This doctoral dissertation work is one of the test cases for the MONK project (Metadata Offer New Knowledge), one of the coolest collaborative endeavors currently out there. Simply put by the project creators themselves, MONK “is a digital environment designed to help humanities scholars discover and analyze patterns in the texts they study.” Sara took a bunch of mid-19th-century English texts, designated some chapters as sentimental (she brought up Little Nell’s death scene from Dickens’ The Old Curiosity Shop as an example), and other chapters as unsentimental. She used these as the training set for the MONK algorithm, “asking” it to figure out more or less on its own what makes a text sentimental or unsentimental, and then having new chapters automatically classified. Statistical analysis then revealed some interesting things: some words are clearly associated with sentimentality (having to do with the female gender, or children, or death, or love), while others are just the opposite (including titles such as Mr./Mrs., and business- and law-related words). Sara’s theory is that this means sentimentality is not just there when we “feel it.” It’s at least in part a formula, used by 19th-century writers to political ends. Her research is still in progress, but is already producing quite cool results.

The other cool URL I gathered from the session is SEASR (pronounced Caesar), Software Environment for the Advancement of Scholarly Research. This project works in tandem with MONK, and seems to aim for “construct[ing] data services that access and normalize unstructured information.” It looks as though the final product will be available not only to large projects but to individual scholars as well; exciting.

Later that evening John Unsworth spoke on “Cyberinfrastructure and Open Standards, Methods, and Communities.” As usual with Unsworth’s dense and whirlwind talks, I quickly gave up on taking notes, Luckily, the entire talk is online, albeit a bit difficult to read without margins. But copy-paste, print it out even, read it: this powerhouse of digital humanities always impresses with his ability to synthesize large, important topics in an accessible way.

MLA ‘07: an unexpected rush

Posted on at 9:00 by vika. Categories: digital humanities, rolandht.

Chicago-town has strange weather. I got in on Thursday to a dry, near-freezing city very similar to Boston – but yesterday there was a wet-snow storm that was gorgeous, diagonal and swirling, out the huge hotel windows but left almost no snow on the ground. That’s lake effect for you.

I’m here for the annual convention of the Modern Language Association. MLA is an odd beast. With about ten thousand attendees a year, I’m pretty sure it’s the largest humanities conference in North America. (I’d be curious to find out that I’m wrong! If you know of a larger one, tell me.) It’s of necessity impersonal, and filled with stressed-out people interviewing for jobs, sitting in one committee meeting after another, taking every advantage of being in the same town as far-away colleagues to cram in as much geeking-out about their favorite geeky topics as they can, losing sleep in the process.

OK, that last part is true of any academic conference. But still, MLA isn’t generally thought of as an exactly enjoyable event.

This year, though, the organizers seem to have gone all out in promoting digital humanities sessions. The poster/demo session I was in, “Electronic Literature: Reading, Writing, Navigating,” was mentioned in the Winter 2007 newsletter – a big deal, considering the thing goes out to 30K members. The result was a rush: the hour-and-fifteen-minute session was packed with people, and I didn’t get to see my colleagues’ work until the very end because pesky people were coming up and being all interested in RolandHT (poster, 1MB, and teaching modules, 31K, both PDF files) .

I loved every minute, of course. The whole thing left me flyin’, feeling much like I do at Digital Humanities conferences. This was both unusual in the context of MLA, and a welcome respite from the past few months’ job search both in and out of academe. So, if you’re reading this and were there: thank you! If you have any further thoughts on the project, please comment here or email me, username vika at this domain.

I’ll post a few session notes later on. For now, breakfast.

Purple Blurb

Posted on 16 September 2007 at 9:35 by vika. Categories: art, digital humanities, rolandht.

This coming Tuesday, September 18th, come to MIT for the first in the Purple Blurb digital reading series. “The readings will start at 6pm at MIT in 14N–233 (second floor of building 14, in the wing that is across the courtyard from the Hayden Library),” says organizer Nick Montfort in the announcement.

The first reader will be Robert Kendall, and I’m very sorry to miss it due to a prior obligation: Rob’s words tend to transport me somewhere familiar I’ve never been before. At the next event on October 18th, I’ll be reading from RolandHT and talking a bit about narrative threads running through it. The other two readings this semester will take place on November 13th (Barbara Barry) and December 4th (Andrew Plotkin).

For a good time, call on Purple Blurb.

MITH puts up podcasts.

Posted on 10 December 2006 at 15:59 by vika. Categories: digital humanities.

The Maryland Institute for Technology in the Humanities, which hosts a series of informal Digital Dialogues, has put up podcasts of the talks that have already taken place this semester. One of them’s my November talk about the Virtual Humanities Lab.

CaSTA: the closing.

Posted on 14 October 2006 at 11:54 by vika. Categories: digital humanities.

Whew, that was grand. Just one thing about the closing panel discussion, while it’s fresh in my mind.

This year’s CaSTA was billed as “a joint computer science and humanities computing conference.” And it was! And [we saw that] it was good. Of the five keynote speakers, three were humanists and two – computer scientists. The final discussion was called “Humanities Computing Science??”.

William Arms, in his remarks during the panel, said that during the conference a word was frequently used that isn’t generally used in his usual [computer-science] circles. That word – knowledge. He, and just about everyone at the panel, said that what they primarily want from the “other side” is dialogue.

In light of that, what I’d like to see in this continuing dialogue is a bit of discussion of the word science. As it’s been used lately (in the last, what, 200 years?), it implies “HARD.” Humanities implies “soft.” That’s a major point of contention.

But given that “science” pretty much means “knowledge,” should we revisit our use of the word?

Siemens on REKn

Posted on at 10:30 by vika. Categories: digital humanities.

Ray Siemens is a computing humanist and Renaissance scholar working at the University of Victoria. The full title of his paper is “Knowledge management and textual cultures? Work toward the Renaissance English Knowledgebase (REKn, pron. “reckon”) and its professional reading environment (PReE).”

REKn seems to be aiming to amalgamate and integrate knowledge in its area. Its implementation is based in the study of disciplinary activity of, and professional interaction among, those in the humanities. It’s founded in concepts of knowledge representation and modeling. A short description of the project can be found here.

Knowledge representation: draws on the field of AI and seeks to produce models of human understanding that are tractable to computation. Modeling: REKn/PReE model data, intellectual processes, and beyond.

Key elements of REKn’s model:
- representation of archival materials
- analysis/critical inquiry originating in those materials
- the communication of the results of these tasks (the dissemination of primary and secondary materials)

REKn’s assumptions: all of the above are interrelated and inseparable, and electronically representable.

They’ve collected primary and secondary sources, and have built tools for working with them (the tool-building process seems to have been multi-stage: many tools built and discarded as inadequate). They’re looking to long-term partnerships with Renaissance materials providers in the future. Right now REKn has about 13,000 primary sources and over 80,000 secondary sources. About 1500 of these resources are currently available for public use, but the majority are not open-access.

So that’s REKn, the text base. What about PReE, the reading environment? It’s a rudimentary document viewer, and analysis and communication facilitator. Currently the UI is a “down-and-dirty prototype,” as they’ve been concentrating on making things work in the back end. [vz: he’s showing PReE in Windows; I wonder what it’s written in.] They’ve made several analytical tools for the encoded texts. Primarily, though, analysis will be carried out using TAPoR tools.

Communication facilitated electronically: they’re attempting to provide a system by which people can manage their professional interaction.

Short-term goals:
- integrate better with TAPoR and the Public Knowledge Project reading tools
- conduct usability studies
- consult with “contextual” stakeholders, including acad. publishers
- move prototype to a web environment
- scale up!

And that’s Ray’s talk, the last talk of this conference. Next is the panel discussion, titled “Humanities computing science?”. The panel will consist of the five keynote speakers. I’m not sure whether I’ll be taking notes on this; it’ll be video recorded, and there’s little probability that I’d do it justice. Again, I’ll update when the webcasts are up.

Lyman on the electronic Piers Plowman

Posted on at 10:07 by vika. Categories: digital humanities.

Eugene Lyman is an Early English scholar at Boston University. The full title of his paper is “Presenting an electronic critical edition of Piers Plowman B.” The B refers to one set in a multitude of manuscripts of this Middle English alliteraative poem, comprising only one of its versions.

EL frames his talk with the following two quotes:

“Computers compute, of course, but computers today, vfrom most users’ points of view, are not so much engines of computation as venues for representation.” –Matthew Kirschenbaum

Understanding the poetics and principles of electronic scholarly editing means understanding that the primary goal of this activity is not to dictate what can be seen but rather to open up ways of seeing.” –Martha Nell Smith

Lyman has created software that allows you to look at the existing manuscript pages from ten different manuscripts, enlarge portions of those pages, read the transcriptions, look at erasures that tell us interesting things about how people might’ve edited in medieval England. You can search within the dataset, visualize the text in various ways, view the underlying XML markup… yesterday (or was it the day before?) EL actually gave some of us an informal demo of this thing, and it is sweet.

How to make a continuity of presenting a single text that exists in multiple manuscripts?

EL created the Elwood Viewer, which looks at documentary editions of texts. Its aims:
- tight coordination of text and image
- visual cueing to guide/reinforce reader’s attention
- handy tools, all within a metaphorical arm’s reach
- ease of navigation, especially at opportune moments
- parsimonious use of screen real estate
- simple, no-cost programming environment, open to change [this is all JavaScript, I think… -vz]

The software allows you to extract data for further analysis. One such analysis called into question the notion that scribes were totally random with respect to the ways in which they encoded [marked up, oh yes, for markup exists in many forms including punctuation and embellished first letters] their texts.

This is all a cricital edition: you take a group of witnesses, compare them, note the variations, and choose the variations you think were on the author’s agenda when she was writing the text. The notion of a critical edition, which effectively “leaves behind” the actual manuscripts, is pretty controversial; nevertheless, at the moment critical editions have a lot of weight in literary studies. [vz: I’m leaning toward the camp that’s sceptical of critical editions, but only in the sense that I don’t tend to consider them definitive, while many others do.]

EL has a prototype version of the critical edition. I haven’t found any screenshots of it on the web, but here’s the Piers Plowman Electronic Archive hosted at the University of Virginia. It seems more a worksite than a ready-made resource, but is worth poking around.

Drucker on visualizing interpretation.

Posted on at 8:53 by vika. Categories: digital humanities.

Johanna Drucker is a professor of media studies, and a founding member of UVA’s Speculative Computing Lab. The full title of her keynote is “Graphic conventions: visualizing knowledge and subjectivity.”

Can we shift from information to interpretation, and shift [back] to a more humanistic point of view within digital humanities? To do that, we have to re-introduce subjectivity, and maybe substitute the mechanistic with the probabilistic (the latter being humanists’ worldview, according to JD).

Visual information conveys information in a form that makes it very hard to analyze systematically. Two ways to create stable knowledge in a notation system: one is with [natural] language, the other – mathematical notation, said Rene Toms of the Oulipo. He never did talk about visual representation, and with good reason: it’s an unstable notation mode.

Subjectivity comes in two forms: position (structural) or inflection (semantic).

The notion of information comes from a particular set of assumptions of what knowledge is. JD not interested of getting rid of this model of knowledge; but rather to propose another model of knowledge.

Visualization is compact, the problem isn’t having enough space to represent information – it’s pinning down the exact nature of the information and our assumptions about its aspects.

Visual information (VI) can lie or misinform, like natural language can.

The silliness of chunking of processes (how authors write stories: “author thinks about a topic” –> “author sketches an outline” –> “author reviews the sketch”) is apparent, but we have to do this sort of chunking when we’re working in a computational environment, which requires discrete units. Because of schematics’ rhetorical power, we eventually come to believe them.

So what about text visualization (as opposed to data visualization above)? Interest in doing things to and with texts is very active, especially within the creative-writing communities. Like TextArc, that sort of thing. They can be silly, ugly, destructive of the original, and yet they have their uses.

Edward Tufte, the exquisite engineer according to JD: information pre-exists visualization. Visualizations can be transparent enough to get us access to information. JD disagrees: visualizations are interpretive, opaque, distortive. They create informmation.

Temporal modeling at SpecLab. Basic assumption: timelines as they are conventionally defined and designed come out of the empirical/natural sciences. Assumptions there: time is unilinear; time is homogeneous (metric is stable); time is continuous (no unbroken intervals in temporality). None of these three things hold. Temporality branches in our lives, in poetry/film/etc. Time is not homogeneous – some moments fly by, others are long (the moment before the kiss and the moment after are very different, JD says). Time is not continous, either: there are breaks/ruptures, recorded in historical accounts for example.

SpecLab constructed a grammar of inflections, of visual elements they’d use to represent time, types of events and their relations to each other. There’s a lot of information JD is giving about what SpecLab has been doing; I’ll point you to the Lab’s site instead of summarizing.

In the IVANHOE game, every action takes place from within a role. Each role has a set of assumptions that go with it. They employed it as a teaching tool at UVA, with the purpose of showing that, in fact, every action stems from a set of presuppositions. [vz: that can’t be right, I’ve captured too simplistic a description. Go see the site for more.]

Subjective meteorology: JD’s current project. An art project, which JD says – duh, art is in the humanities. [vz: yay!] She charted and graphed and visually represented a bunch of weather patterns – lines of anxiety/anticipation, storms of anger – which look gorgeous on the slides but don’t seem to be on the web. These representations can be chained together and animated to playfully and visually represent one’s subjective perceptions of the world around.

Great discussion follows. I can’t pretend to capture it well enough; I’ll post an update when the keynote webcasts are up, and urge anyone interested to watch this one when it’s available.

Hoover on CaSTA and breadth.

Posted on 13 October 2006 at 16:06 by vika. Categories: digital humanities.

David Hoover is at NYU, and is the Vice-President of the Association for Computers and the Humanities. The full title of his paper is “CaSTAing breadth upon the waters.” (”Cast thy bread upon the waters: for thou shalt find it after many days,” say Ecclesiastes 11 – hence his title.)

DH seeks simple methods for examining word frequencies in corpora of single authors, at different stages of their production. Do authors tend to start disliking (using less) words they used to like (use a lot) earlier in their careers? Or vice versa? How does an author’s vocabulary change over her production’s life? DH does a lot of statistics to try to find out.

He’s talking about Trollope, about whom I know nothing. Apparently, the 100 most variable words in his corpus are all proper names. They also all appear in more than one stage of his writing career (early-middle-late). Henry James, however, has some non-proper nouns.

DH’s project is very much in progress; he says it’ll be a while until he has something interesting to say about the evolution of writerly language. One interesting question is: if you have a writer who starts writing very young, does their vocabulary change quickly, early on? What about writers who write far into old age?

One interesting consistency in James is that he seems to have used fewer “precious” nouns as the years went on: words like coquette and tresses.

Cunningham on the Arte of Navigation

Posted on at 15:34 by vika. Categories: digital humanities.

Richard Cunningham is at the Acadia University English department, and directs the hypermedia center there. The full title of his paper is “Developing digital navigation from The Arte of Navigation.”

Readers experience something different reading an electronic document as opposed to a paper one. [Glaringly obvious, RC admits.]

RC presents the Acadia Digital Culture Observatory. They have digitized a 1561 edition of The Arte of Navigation and want to observe how readers read and use it. The original text included a navigation instrument made of three concentric paper circles of different sizes (volvelles), which are to be overlaid one on top of another and rotated. Here, you can see it for yourself. (Hm, it doesn’t seem to work in Firefox on Mac in Blackletter mode; I suggest the use of Arial to be safe, or you can download Blackletter from their table of contents.) Check out particularly the navigation instrument Flash files in the “other moving images” section of the TOC, they’re fun to play with.

TAPoR on TAPoR.

Posted on at 14:23 by vika. Categories: digital humanities.

Ray Siemens of Victoria hosts a session of three papers related to the Text Analysis Portal for Research. First we have Geoffrey Rockwell, with “Text empires: text analysis in excess.” Shawn Day will talk about “The use of the recipe as a guilding metaphor for flexible and efficient self-guided computing instruction.” Finally, Stéfan Sinclair will talk “On data & views in text analysis.” All three presenters are from McMaster University in Hamilton, near Toronto.

ROCKWELL.

Information overload: 5 exabytes of information created in 2002. Exabyte = 1,000,000,000,000,000,000 Bytes. It’s a thousand petabytes, or a million terabytes. [Holy wow.] Spam is cheap, but reading has costs. How can text analysis help?

Why this explosion of information?
- growth in population and wealth: more money, more media toys
- multiple-media, from the photograph (1820s) to the iPod
- digitization of information and business practices: cheap creation, storage, reproduction, and transmission

Challenges to the system: what are the effects?
- experience of information overload
- multimedia shock
- narrowing expertise (because nobody can’t keep up with a broad discipline!)
- archive fever

What can we do?
- understand the problem (literary dimension to it; a problem of scale, a bibliographic problem)
- produce less? [shock! I can hear the internal gasps around the room!]
- file and (not) store smarter
- find smarter (not more) [ooh, I’ll quote him in my dissertation work! no, I cannot, in fact, process all the litcrit written to this day]
- learn to read differently

The latter two of the above are opportunities for text analysis.

Problem of scale to text analysis for finding and reading:
- heterogeneous formats and multimedia rich
- closed (”for perfectly reasonable reasons” -GR) information empires (Google) build on existing indexes or build their own
- new questions, research methods (data mining and visualization)
- text analysis tools developed for coherent texts (collaborate with data mining & HPC [high-performance computing] community)

TAPoR.2 model, Beyond Finding and Reading:
- gathering and aggregation function (working with existing empires like Google; create your own study library (myEmpire))
- mining function (clustering and classification; provoking questions, not finding)
- interface and visualization function (effective interactions for research)

DAY

They’re using the recipe metaphor to get people of different backgrounds to use TAPoR.

A recipe for self-guided instruction:
- ingredients
- steps
- glossary
- discussion
- further information

Ingredients:
- ingenuity
- a useful metaphor
- a versatile set of tools
- users desirous or willing to consider using said tools

Steps
- identify objective
- consider users’ needs
- develop case studies that describe how your tools can meet these needs
- apply a familiar metaphorical approach to engage and instruct
- deploy recipes through a wiki

Glossary
- recipe: a useful guiding metaphor that offers optimal flexibility…. [couldn’t get it, too fast]

Further Information:
Try the recipes out! (For example.)

Nice, familiar, easy concept. As Shawn is pointing out right now, super easy to engage a beginner user. This could be very useful, as well, when getting folks used to traditional humanities research methods to try, say, text encoding.

SINCLAIR

[Stéfan is the creator of HyperPo, the coolest text analysis tool ever so far.]

Generally, there’s a one-to-one mapping between tools and the data views of their results. SS has been thinking more in terms of this progression:

text -> tool -> data (TAML) -> style -> view

Among other things, he wanted to create a framework to use in teaching the development of text analysis tools in a modular way.

It’d also be nice to be able to chain tools together – you run a tool on a text, get the resultant data and feed it to another tool, and so on. This requires tools that can ‘talk” to each other, and output data in the same (or similar enough, or easily translateable) formats.

HyperPo 7.0 is coming soon!

Arms on vast amounts of data.

Posted on at 13:16 by vika. Categories: digital humanities.

William Y. Arms is a computer scientist currently working at Cornell. The full title of his keynote is “Humanities and social science research using vast amounts of web data.”

Examples of very large collections:
- Library of Congress: National Digital Information Infrastructure and Preservation Program
- The Internet Archive’s historical collection of the web (600 TB, terabytes)
- Large scale digitization projects: Open Content Alliance, Project Gutenberg, Google, Microsoft, Yahoo, etc.
- USC Shoah Foundation: Survivors of the Shoah (400 TB)

How will humanities and social science scholars do research on collections which are large by supercomputing standards?

“Only the computer reads every word” –Greg Crane
- Researchers interact with the collections through computer programs that act as their agents.
- Users rarely view individual items except after preliminary screening by programs.
- Collection requires a highly technical computer system that is used by researchers who are not computing specialists.
- The collection is a high-performance computing system.
- Use of the collection depends on automated tools, which require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.)

Example: the Cornell Web Lab (or is it a Library, asks Arms?)

The structure of text:
Manual analysis and mark-up
- skilled bibliographers and cataloguers
- manual textual markup
- semantic web tools for representing relationships (e.g., RDF, Fedora)
Semi-automated methods
- automated name recognition under human control (e.g., Perseus)
- expert-guided web crawling (e.g., iVia)

The above are tens of millions of records. How do we manage billions of records?

Example: The Internet Archive web collection
The data: complete crawls of the web, every two months since 1996, with some gaps:
- range of formats and depth of crawl have increased with time
- no data from sites that are protected by robots.txt or where owners have requested not to be archived
- some missing or lost data
- metadata contains format, links, anchor text
- organized to facilitate historical access to a known URL (Wayback Machine)

The research dialog between a scholar (S) and a computer scientist (CS) goes something like this:
S: Here’s a study we’d like to do…
CS: We don’t know how to do that analysis, but would this be any use to you,
S: Not as you suggest it, but here’s another idea…
CS: That might be possible, with the following modification…
BOTH: Let’s try it and see!

Eventually we get something that is both useful from a research point of view and feasible from a computing POV.

Social Science Research:
- the web as evidence of current social events (spread of urban legends; development of legal concepts across time)
- the web as social phenomenon (political campaigns, online retailing, polarization of opinions)

Research topic example: social and information networks, joining a community. Question: what is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters? New behavior could be: adopting a new technology, joining a club, etc.

So, when everything is in digital form, will the library go from being the largest building on campus to being the largest computing system on campus? WA says there’s a good likelihood of that.

WA goes on to describe some of the projects on the Web Lab’s plate right now. Their descriptions can be found on the Web Lab site.

Policies issues on the use of the lab: custodianship of data; copyright; privacy.

Design guidelines for builders of large digital collections:
- every online collection or service needs an application program interface (API) for computers, not humans, to interact with the library.
- a primary methodology is: select a subset of the collection; download to researcher’s computer; use programs on the researcher’s computer to analyze the data.
- almost all metadata will be computer generated, but human cooperative editing can correct errors.

Pytlik Zillig on TokenX

Posted on at 11:17 by vika. Categories: digital humanities.

Brian Pytlik Zillig is an all-around digital-library tech wizard at the University of Nebraska-Lincoln (UNL), which hosted the first annual Digital Humanities Workshop a few weeks ago. The full title of his paper is “TokenX: a text visualization, analysis, and play tool designed for the XML document tree.”

Some history:
- CDRH and other digital centers significantly rely upon XML
- XML is a 1998 [whoa, old] recommendation [hunh, not a standard] of the W3C
- XML is a robust and flexible medium for content
- UNL has been using XML/SGML since 1998
- all CDRH projects use XML

Research question, born in 2004:
- can emerging standards assist in text visualization, analysis and play? (for example: XSLT)

BPZ’s goal:
- use XSLT to explore text visualization, analysis, and play (TVAP)
- provide TVAP options useful to facilitate the creative, qualitative, and quantitative exploration of XML text

Why another text analysis tool?
- there are good tools available written in a variety of languages, but none are created in XSLT, and none that takes advantage of the special relationship between XML and XSLT

Say we take a Shakespearean sonnet line: “when to the sessions of sweet silent thought.” You’d be crazy to try to try to mark up every word in XML, it’d be a huge undertaking. But XSLT 2.0 can add markup to words using tokenization! Way cool! Tokenized, each word will look like this: <w>word</w> – and here’s a punctuation mark: <nonWord>,</nonWord>

With this markup, XSLT can be used to do a variety of TVAP actions on a text. TokenX ingests XML documents, retaining the original markup, and adds tokens like the examples above. Visualizations include word highlighting, keywords in context (looks a bit like a concordance), replacing words with blocks (for example, to find words that are too long?), highlight punctuation and non-words, all kinds of stuff.

Here’s the TokenX site, if you’d like to play with it.

Analyze, in the TokenX context, means:
- count words in context
- decontextualize words and count them (frex, list all the words in the document alphabetically, or by frequency, each word only once with a number of its occurrences next to it)
- word statistics (how many words, how many elements containing words, mean number of words per element)
- punctuation and non-word statistics

TokenX exports into spreadsheets, so you can export and save your dataset.

You can play with TokenX:
- substitute words
- replace words with images

Best part: it’s free and open-source. “You can change it!” Brian exclaims. Excellent.

Wulfman on the Modernist Journals Project

Posted on at 10:46 by vika. Categories: digital humanities.

Cliff Wulfman is working at Brown – lucky us! (Major shout-out to Cliff.) The full title of his paper is “The Modernist Journals Project: A new architecture.”

Here’s the MJP site. It’s evolved from a quite small-scale faculty project. Cliff talks about how to take one of those and move it toward technologies that will allow it to grow and expand and move at a healthy pace. Their primary-source set is pretty large: modernism grew up largely in periodicals, and they’re digitizing them and putting them online.

Complete runs of magazines are scarce, CW says. Even when they exist, oftentimes the advertising has been stripped.

The MJP started out with a desktop scanner and an OCR (optical-character-recognition) package. One office, one faculty member, several students. That’s all. They tried to digitize all 30 volumes – nearly 18,000 pages! – of The New Age (”a weekly review of politics, literature, and art”) that way.

The limits of the original implementation: it was labor-intensive, hand-scanned and hand-coded; and it was served through eclectic, hand-made HTML pages. The MJP was outgrowing the prot in which it was seeded, CW says. It needed:

- engagement with the concept of “cyberinfrastructure”
- embrace of new technologies, standards, best practices that weren’t in place when the project was first conceived.

So they stepped back and devised a new architecture:
- complex digital objects based on digital library standards (METS, MODS, MADS)
- XML substrate
- data- [?] and database- driven service
- polymorphous delivery: can deliver in formats other than PDF

We then had a demo. Go look at the site for more. :)

Future directions:
- access to new scanner technolgoies will enable vast collection growth
- developing an interlinked encyclopedia of modernism
- build on Fedora’s digital library infrastructure

Hirtle on TRANSLATOR.

Posted on at 10:25 by vika. Categories: digital humanities.

David Hirtle is doing graduate work here at the University of New Brunswick, in computer science. The full title of his paper is “TRANSLATOR: a TRANSlator from LAnguage TO Rules.”

Semantic web is still n ot widely used.
- Focus of current development: machine-readable (meta)data
- Problem: only experts can contribute. Need to lower barrier to entry.

Provide a user-friendly format!
- why not English [he really means natural language]?
- “controlled English” avoids ambiguity: it’s formal, but also natural

TRANSLATOR will translate “every student gets a discount of 15 percent” to express [in XML, from what I see] that “student” implies “customer,” etc.

ACE (attempto controlled english):
- looks like English: “every honest student who does not procrastinate receives a good mark and easily passes the course.”
- but actually a formal language, like RDF: a tractable of English – all ACE sentences are English, but not vice versa
- every ACE sentence can be unambiguously translated into logic.

Strategies for handling ambiguity:
- exclude imprecise phrasings (”students hate annoying professors” – do they hate to annoy profs, or do they hate profs who are annoying?)
- interpretation rules (”the student brings a friend who is an alumnus and receives a discount” – who receives the discount? in ace, by default, it’s the student because of a certain rule. If you want it to be the alumnus, you write “…and who receives a discount.”)

How can rules be expressed?
- in natural language, many different forms (everyone is mortal, all humanity is mortal, for each person the person is mortal)
- all above are valid ACE
- further embellishment (negation, relative clauses, etc) [vz: but doesn’t that add ambiguity?]

What can’t yet be easily expressed?
- “infix” implication (”the student is happy if there is no class” – solution: TRANSLATOR swaps the condition(s) and conclusion(s) and voila, ACE-acceptable)
- production and reaction rules (involve actions: “if a student is caught cheating then send a report to the registrar” requires the imperative mood, which is not yet in ACE)

Discourse representation structures, and more technical info. Sad, I can’t reproduce his diagrams here. The rules are eventually translated into RuleML, in whose development David is participating.

RuleML:
- goal is interoperable rule markup (XSLT translators to other semantic web languages)
- family of “sublanguages” (modular XML schemas; each represents a well-known rule system; TRANSLATOR uses First-Order Logic sublanguage)

Why use RuleML?
- ease of interchange (XML)
- compatibility with RDF and other languages, as well as W3C’s upcoming Rule Interchange Format
- availability of tools
- wide fariety of features (negation-as-failure, weightings, data types etc.)

Again, work-in-progress. Truly an attempt at getting closer to the semantic web. Formalizing natural language, what a gargantuan task. One critical benefit of TRANSLATOR is that it “allows non-experts to write facts and rules for the semantic web.” When can we play with it?

Now, it appears. Here’s a site for TRANSLATOR, including a Java Web Start demo.

Lukon and Juola on building an index generator.

Posted on at 10:00 by vika. Categories: digital humanities.

Shelly Lukon and Patrick Juola are both at Duquesne University. The full title of their paper (presented by Lukon) is “Designing a context-sensitive machine-aided index generator.”

Problem definition.

Back-of-the-book indexing provides relevant terms, identifies cross-references and subcategories, and has a static, rigid structure (as opposed to web indexing). Human indexers invest a LOT of time into indexing (1 week per 100 pages of text); use software to automate mundane texts; and make all the intelligent indexing decisions. SL&PJ’s prototype system bridges the gap between the human and currently available tools, but not to replace the human indexers.

They’ve interviewed professional indexers, product-tested some of the software packages they tend to use, and looked at some mathematical techniques (particularly LSA, latent semantic analysis) that have had proven success in text processing and capturing semantic content of terms.

Cognitive tasks involved in index construction:
- identifying terms to index;
- locate all informative references;
- identify/locate synonymous terms;
- split index terms into subterms;
- develop cross-references within text;
- compile page numbers.

Their techniques for obtaining semantic information:
- parsing/tagging of terms, frequency analysis
- LSA
- word sense disambiguation (WSD)
- hierarchical cluster analysis (HCA)

This is still a work in progress. So far they’ve been able to locate all informative terms in text, and to allow the user to set thresholds/parameters. LSA, WSD and HCA show first level of clustering nearly 40% accurate upon inspection (not great, but a solid start). Their single-processor PC takes several hours to process small (60K words) corpora. Better than a human’s speed!

They’re categorizing words into parts of speech: identify the part-of-speech of each term; label each term with delimiter and acronym (home becomes home/NN since home is a noun). They’re only dealing with English right now. Their app is written in Java, as is MontyLingua which they’re using for part-of-speech tagging.

LSA:
-use factor analysis to generate numerical representations of terms and their meanings;
-divide corpus into “documents” (paragraphs), then analyze each unique “term” (word) relative to each document;
-create term-by-document matrix;
- create term-by-term covariance matrix (look at how each pair of terms vary together)
- singular value decomposition (SVD) - a way of explaining variability among random variables (dimensions)
- decompose covariance matrix into three submatrices [over my head here]
- rank resulting values
- reconstruct using most significant dimensions (reduce noise, sharpen similarities/contrasts)
- 200 most significant dimensions: pinpoint each term’s location in 200-dimension “semantic space” [why 200?]

WSA
- separate out different senses (meanings) of each term token
- numerical encodings generated by LSA give average context for each term token
- look at encodings of the other terms surrounding each occurrence of the token
- Example: the word “bass” occurs throughout text (both as fish and as musical instrument), proximate to other words (guitar, boat, fish) that help disambiguate
- disambiguate “bass” into “bass_fish” and “bass_instrument”

HCA
- partition terms into subsets with similar properties/characteristics
- antonyms as well as synonyms will cluster together (both have strong relationships, but the system doesn’t know whether they’re positive or negative)
- this information can be used to identify cross-refs (see also) and subterms

This is a machine-aided system. Its purpose is not to replace but to assist the human indexer, whose judgment and experience cannot be fully captured by a sophisticated expert system. Users can edit results at any stage, control indexing parameters, etc.

Metrics for evaluating the “goodness” of the resulting index:
- side-by-side comparison between entirely-human-generated and machine-aided indexes of the same dataset, quantify what percentage of agreement is acceptable, maybe find meaningful information in how they disagree as well.

Future work:
- incremental refinement
- system has modular architecture for ease of swapping out individual components
- need robust, effective user interface
- empirically vary frequency thresholds, weighting methods, number/percentage of dimensions to use in the reduced data matrix
- continue to build in the latest/most efficient indexing/retrieval methods.

What a great project. I’d love to use it for RolandHT, but it probably won’t be done in time. Enabling the software to read/process XML is on their wish list of big enhancements, hooray!