Category: Open Data
August 10th, 2009
Moving Data.gov towards the Semantic Web
Government transparency in all its forms would appear to be very much in vogue at present, spanning everything from the Obama administration’s Data.gov portal and Prime Ministerial pronouncements in the UK Parliament to municipal proclamations of openness in Vancouver and compelling grass-roots demonstrations by activists and even newspapers.
At the heart of many of today’s initiatives lie programmes to surface Government data for use and re-use by third parties. The ‘open’ in ‘Open Data’ is, of course, a very loaded term, and I’ve looked before at some of the ways in which data might become ‘open’ whilst remaining effectively useless. Nevertheless, Governments’ current enthusiasm for being seen to embrace transparency should certainly be both welcomed and encouraged, and there are real opportunities to work with Government in ensuring that today’s transparency fervour continues undiminished, whether by omission or commission.
Given the complex and varied nature of the data involved, and the obvious linkages between the entities (you and I, our communities, our schools, our hospitals) described in numerous different databases, there’s a clear opportunity for technologies and approaches from the Semantic Web community to play a significant role in simplifying the whole process of moving these legacy databases online.
Already interested in Open Government from previous roles, and (obviously!) committed to encouraging real-world adoption of semantic technologies, I’ve spent some time recently talking to a number of those involved. A number of those conversations are now available as podcasts, and I’ll continue to seek out fresh examples and perspectives to share.
My most recent podcast conversation, released today, is with Professor Jim Hendler and Dr Li Ding of the Tetherless World Constellation at Rensselaer Polytechnic Institute in Troy, NY. The team at Rensselaer have been working with some of the US Federal Government’s data sets on Data.gov, and so far they’ve converted sixteen data sets from their original form, resulting in 2,927,398,352 freely available RDF triples and a number of demonstration applications.
Other conversations already released in the series include;
- David Eaves, talking about Vancouver’s commitment to Open Data
- John Sheridan, Head of e-Services at the UK Government’s Office of Public Sector Information, talking about his Department’s efforts to get Government data online
- Mark Birbeck, talking about work with the UK Government’s Central Office of Information to embed lightweight RDFa into workflows and web pages
Each offers an example of ways in which ‘open data’ contributes to Government transparency, or to increasing the value of the massive sunk investment in collecting, managing and curating the data upon which Governments depend. The Semantic Web’s notion of Linked Data (whether actually in RDF or not!
) offers a means to increase the utility of the data we have, without a massive programme of reengineering the systems used to manage it. The examples we see today, and the work of the individuals and teams with whom I have been speaking, will teach us a lot about how to make this work at Government scale.
June 18th, 2009
New York Times embraces Linked Data
The keynote on this final day of the Semantic Technology Conference saw Robert Larson and Evan Sandhaus of the New York Times talk about the paper’s innovative adoption of semantic technologies;
“The first semantic search system for The New Times was released in 1913 and was available bound in either paper ($6) or cloth ($8). In the 96 years since the advent of The Historical Index to The New York Times, semantic technology has become central to The New York Times’ daily operations and the focus of much internal research and development. In our keynote, Rob Larson, VP of Digital Production, and Evan Sandhaus, Semantic Technologist, will review the long history of semantic technology at The New York Times; discuss the application of this technology in our operations; and review an innovative initiative to enlist the global community in solving some of our toughest challenges.”
Sandhaus and Larson begin by referring back to the Times‘ long history, and the early importance of the paper’s emphasis on building - and selling - a comprehensive abstracting and indexing service to stories in the paper. This, they suggest, was important in leading to the paper being considered as the paper of record, ahead of its numerous competitors.
Building upon the paper’s nine-month old ‘Annotated Corpus’ and its associated APIs, Larson closed the session by announcing that the Times‘ thesaurus is to be made available using a license and APIs that will see it available to play a part in the wider Linked Data cloud.
June 16th, 2009
Semantic Technology Conference kicks off with Keynotes from Open Calais and Siri
This year’s Semantic Technology Conference got fully underway this morning, with Keynote presentations from Tom Tague of Thomson Reuters’ Open Calais Initiative and Tom Gruber from Siri.
Despite the wider economic situation, attendance for this fifth year of the event feels a little up on last year, and there’s clearly real enthusiasm in the buzzing Halls.
Tague’s Open Calais has been one of the success stories for useful and easy application of semantic technologies beyond a core community of enthusiasts and adopters, and has been covered here and on Cloud of Data a number of times since it launched. Just today, they announced a new set of partners and a postal service that should remove one more perceived barrier for another set of potential adopters.
Speaking to the theme of ‘Web 3.0 - the Web of Me,’ Tague’s abstract suggests;
“The mainstream adoption of Web 2.0 technologies – from RSS feeds to social networks – is hastening the demise of the portal. With each new face on Facebook, and each new Twitter account, our once routine habits and traffic patterns shift. This wave of change in the way we consume, transact and interact on the Web is dis-intermediating ‘destination’ sites of all kinds. Our once centralized content has been atomized.
And yet our fundamental problem persists. We’re overwhelmed with input, yet still can’t find the one thing we need… now.
Semantic technologies – and the content interoperability and Linked Data connections they beget – offer new hope. That is not to say the answer lies in building new search engines, and few would argue for another news aggregator. Rather, our point of inflection lies at the point of consumption. Our task is to simultaneously refine and enrich our digital experience of everything from content and community to commerce.”
Early on, Tague made a ‘non-apologetic statement;’
“People need to start deriving financial benefits from semantic technology. It’s time”
Absolutely!
Tague looks back at the move from ‘Web 1.0,’ described as ‘the last Web we agreed on,’ to ‘Web 2.0,’ which he sees as largely defined by the ‘addition of social.’ Today, he reckons, we are ‘extraordinarily content-rich’, ‘extraordinarily information-poor’ and ‘experientially deficient.’ Despite a wealth of content, we are failing to make the most of it.
‘We’re at the inflection point’ where ‘innovation is exploding’ as we move from developing and inventing toward mainstream adoption of technologies in the semantic technology space. Lots of things will be tried; 90% will fail, but that’s ok.
‘Everyone needs plumbing,’ and that’s what Calais is; semantic plumbing. 13 version releases in 18 months; about 100 presentations, 13,000 registered Open Calais developers, a million great ideas.
Tague reckons the various efforts he comes in contact with fall into six broad buckets;
Tools; Social; Advertising; Search; Publishing; Interface.
First, Enabling Tools. Data Management, Data generation, Databases, Integration and workflow. ‘A big yes.’ ‘We need tools.’ Everyone needs tools, especially as you move from early adopters toward the mainstream. Tools build the bridges that cross the chasm to enterprise adoption.
Enterprise adoption will not happen because it’s cool. Enterprise adoption will not be talked about on Twitter. Enterprise adoption will happen because it’s cheaper/faster/better than what they have just now.
‘Tool vendors need to simplify their story; it’s not about more functionality.’ ‘If I can’t understand your story, then Enterprise IT certainly can’t’
Second, ‘let’s put some frosting on top of social.’ ‘Wouldn’t it be cool if we could…’ Some of it might be cool, but there’s a challenge in monetising social. Adding frosting to the top of an industry that hasn’t worked out its own monetisation is fraught with risk.
‘I haven’t seen a compelling story yet.’
Next, advertising. Almost a dirty word in the semantic technology domain last year. But advertising is fuel, and semantic technologies have a clear role to play in enhancing advertising (see my podcast with Scott Brinker from last year…).
Semantic search; ‘the semantic industry’s brilliant yet under-achieving child.’ The answer to a question no one is asking? General, consumer-facing semantic search… directly competing with Google et al? Not viable.
But vertical search in specific domains… a huge growth opportunity, and people are willing to invest the time, effort and money to make it happen. Room for a handful of players in each domain?
Search; ‘a bifurcated marketplace.’
Publishing; content producers, editorial/aggregation, ‘robotic publishing.’
‘Classic publishers can get enormous value from this technology… not all of the value is in the user experience.’ Much of the value is being found in the back office, making existing data and investments work harder.
Little value in ‘robotic publishing,’ because the content isn’t that readable. Aggregation services like Huffington Post and Daily Me present ‘enormous opportunities.’
Interface; gaming a huge and growing market. $57bn industry. A ’seamless, interactive and responsive experience,’ it’s ‘graphically engaging and fun.’
Zemanta, AdaptiveBlue, Feedly, Apture et al ‘trying to make the consumption experience different’ [better?]. Not suggesting that these are like a game, but many of the drivers may be similar?
“People are on their mobile devices and in the browser; go where the people are.” Which links well to the next keynote…
“Do you care about semantics or about user value?”
“Don’t fund/buy semantic infrastructure beyond what you need; use infrastructure built by others where possible.”
“Think very hard about the user experience; make it compelling and exciting.”
Following Tague’s presentation, Tom Gruber took to the stage to talk about Siri; a company building a Virtual Personal Assistant (with an interesting iPhone app to start things off) that we discussed during a podcast last week. As Gruber’s says;
“We are beginning to see a new interaction paradigm for the web: the Virtual Personal Assistant (VPA). A VPA is task focused: it helps you get things done. You interact with it in natural language, in a conversation. It gets to know you, acts on your behalf, and gets better with time. The VPA paradigm builds on the information and services of the web, with new technical challenges of semantic intent understanding, context awareness, service delegation, and mass personalization.
Siri is a virtual personal assistant for the mobile Internet. Although just in its infancy, Siri can help with some common tasks that human assistants do, such as booking a restaurant, getting tickets to a show, and inviting a friend. We will describe the technology underlying Siri and how it fits in the larger ecosystem of services and data providers. And we will offer a vision of where assistants like Siri are going.”
Tom starts off by showing the Knowledge Navigator video from Apple… which dates all the way back to 1987. Many of the ideas are now coming to fruition; touch screens, a global network, awareness of temporal and social context, speech in and out, a ‘conversational interface,’ ‘delegation of work’ to the machine, and trusted use of personal data.
Is the Knowledge Navigator possible today? ‘No, but we’re getting there.’
Siri is pretty close… in certain well understood contexts, as Gruber shows in a video demo of the evolving iPhone application.
What is a Virtual Personal Assistant? It does things for you; it’s task-oriented. It understands your intent via a conversational metaphor. It gets to know you; it’s not the same for everybody, unlike a search engine.
‘Service delegation [like Siri]; the mother of all mashups’
‘Context is king’ in communicating with a VPA; where am I, what time is it, who am I, etc.
“This really is the beginning of the age of the start of Virtual Assistants.”
Need to solve authorisation/ authentication. If we reach a ‘data commons’ there will be more, better, information to drive choices and decisions.
Tom Tague is a regular member of the Semantic Web Gang podcast, which I moderate. Tom Gruber was the latest guest in my Executive Briefing podcast series.
April 15th, 2009
Leigh Dodds talks about Talis Connected Commons
I wrote about Talis’ Connected Commons last month, and today spent some time talking with the company’s Platform Programme Manager, Leigh Dodds.
The conversation has just been released as a podcast which looks at the rationale behind the company’s offer and the specific licensing choices that beneficiaries are asked to make.
Have a listen, and see if the Connected Commons might help your next project.
Disclaimer: Talis is my former employer
March 30th, 2009
Growing the Linked Data pool, with the Talis Connected Commons
Back in December of 2008, I wrote about a new initiative from Amazon to make large sets of public data more accessible. Amazon offered to mount the data for free, and for developers writing applications elsewhere in the Amazon Web Services ecosystem even the bandwidth cost of communicating with GenBank, the PubChem Library, the US Census and similar resources was zero. As I wrote at the time,
“By offering free hosting for public data, then, Amazon are doing the wider community a huge service. Much of the data there today is reasonably readily available from other sources, so the biggest immediate benefits are those of speed and cost… For existing or potential users of Amazon’s Web Services to power their applications, this is yet another reason to consider Amazon.”
Over the weekend, Talis made a similar offer to host public domain data (licensed under Creative Commons’ CC0 or Open Data Commons’ PDDL).
What’s interesting about the company’s ‘Connected Commons‘ is that the data sits in their Semantic Web Platform; all of the APIs for querying and managing data are at your disposal, free of charge.
Sir Tim Berners-Lee recently called on the holders of data to loosen their grip, demanding Raw Data, Now! For those prepared to take that step, and unsure where to go next, offers such as those from Talis and Amazon are certainly worth a very close look.
Disclaimer: Talis is my former employer
March 25th, 2009
Jeff Pollock discusses his new book, The Semantic Web for Dummies
Oracle’s Jeff Pollock has been involved with Semantic Technologies for more than a decade, and puts that experience to good use in his latest book, Semantic Web for Dummies.
I spoke with Jeff yesterday, and the result has just been released as a podcast.
Have a listen to hear Jeff’s intentions for the book and some of his wider views on the ‘failure of Natural Language Processing,’ the Semantic Web label ‘not speaking to a problem… or a solution,’ and the ways in which Semantic Technologies are quietly being put to work in the enterprise.
January 14th, 2009
Thomson Reuters bets on Content remaining King with Calais 4.0
Global information behemoth Thomson Reuters today announces the latest version of its Calais web service, delivering on earlier promises with respect to ‘Linked Data’ and firmly staking out the company’s intention to be a significant player in the shifting market for timely and authoritative information.
I’ll take a more in-depth look at the importance of authoritative sources in the emerging Linked Data ecosystem in this related post, and concentrate on the specifics of the Calais 4.0 release here.
Thomson Reuters’ Tom Tague describes version 4.0 as
“a fundamental change to the underlying service; it’s basically a new service”
This re-engineering of Calais will deliver the functionality that users have come to rely upon, whilst ensuring Thomson Reuters’ ability to continue to scale in a timely and cost-effective manner on the back of Amazon’s Web Services offering.
Tague describes the service released today as a technology preview to run alongside the existing Calais service for a period, but he is confident that it is at production strength from Day 1. Developers, Tague suggested, would
“try it and stay.”
In addition to this strengthening of the core offering, Calais 4.0 includes five substantive developments.
First, the company has followed through on earlier talk about ‘Linked Data,’ ensuring that any of around 25 entity types (company names, geographic areas, album titles, etc) discovered in content submitted to Calais will now be returned to the submitter with a ‘dereferenceable URI‘ that may be followed by either people or software in order to discover further information. The URI resolves to a Calais-hosted page of RDF with pointers to the Linked Data community’s usual suspects; DBpedia, MusicBrainz, GeoNames, the CIA Factbook, etc.
More unusually, and importantly, the second development sees the document include pointers to Thomson Reuters own content such as the (current) stock ticker, Board membership data, etc.
As the Press Release notes,
“In keeping with its commitment to the Linked Data standard, Thomson Reuters has also made a subset of its core data assets available for public use on the Web. The collection of business information represents the first contribution to the ‘Linked Data cloud’ made by a major publisher. It enables developers to programmatically query and use fundamental facts on hundreds of thousands of publically-traded companies, including company descriptions, stock tickers, management teams, locations, boards of directors and more.”
Thirdly, Calais 4.0 includes a ‘metadata transport layer’ to simplify the process of exposing and sharing large bodies of semantically rich data. Tague suggested that 2-300,000,000 persistent and dereferenceable URIs are available today (and capable of servicing tens or hundreds of millions of hits per day) for content previously submitted to Calais, with many more to come as the service scales.
Fourth, Calais is making its first move beyond English language content, and version 4.0 now supports entity extraction in French. French-language relationship and event extraction will follow shortly, as will other languages. Tague suggested that Hebrew, Arabic and Chinese will be amongst those rolled out during 2009. Behind the scenes, the team are also experimenting with automated translation services, which Tague reports to be ‘working very well’ in the lab.
Fifth, and finally, the Calais team is publishing an RDFS version of their schema, giving developers far more flexibility as to the ways in which they integrate the Calais web service into their own applications.
All in all, a welcome set of incremental improvements to Calais that also serves to raise an interesting set of questions about the role of ‘professional’ data in the Linked Data ecosystem.
Thomson Reuters’ Tom Tague is a regular member of the Semantic Web Gang, and should be discussing the release of Calais 4.0 in more depth on this month’s show, due to be recorded on 15 January.
December 16th, 2008
Could Amazon provide a home to Linked Data?
In a press release issued earlier this month Amazon announced their ‘Public Data Sets on Amazon Web Services‘ initiative, providing a free home to potentially massive public data sets and free use of those data by developers hosting their applications in the company’s data centres.
Larry Dignan covered the story for ZDNet at the time, but the Semantic Web angle arose in mailing list-based discussion amongst members of the Linking Open Data community project, which is supported by the World Wide Web Consortium.
Kingsley Idehen (with whom I recorded a podcast earlier this year) began the thread, writing;
“Please see: http://aws.amazon.com/publicdatasets/ ; potentially the final destination of all published RDF archives from the [Linked Open Data] cloud.
…
Once the data sets are available from Amazon, database constructions costs will be significantly alleviated.
We have DBpedia reconstruction down to 1.5 hrs (or less) based on Virtuoso’s in-built integration with Amazon S3 for backup and restoration etc.. We could get the reconstruction of the entire LOD cloud down to some interesting numbers once all the data is situated in an Amazon data center.”
(my links)
As I note in a blog post here, data bundled up inside the ‘Elastic Block Stores‘ that Amazon offers aren’t fully-fledged participants in the open data web, but developers already comfortable with Amazon’s Web Services certainly do gain free and easy access to incredibly large bodies of data.
To have the data sets already collected by the Linked Open Data community easily available to the (different) community of Amazon Web Services developers would go a long way toward educating them about Linked Data and its potential… even if the resulting applications aren’t necessarily reliant upon Amazon infrastructure.
December 9th, 2008
Zemanta talks Linked Data with SDK and commercial API
I covered Slovene semantic technology startup Zemanta back in September when they secured investment from New York City’s Union Square Ventures, and the company also received frequent mentions in the Semantic Web Gang’s recent look back over 2008.
Yesterday, the company released an update to their popular WordPress plug-in and today they announced [PDF] commercial availability of their ‘Semantic API.’
The company describes the API, suggesting that;
“We analyze your post through our proprietary natural language processing and semantic algorithms, and statistically compare its contextual framework to our preindexed database of content.
We are using a combination of machine learning techniques and end-user input from our widget users, that enables us to train the engine and constantly improve the recommendations.”
Users familiar with the blog plug-in will recognise - and probably value - these capabilities, which the API makes available for use in other situations.
Superficially, there are clear similarities with the capabilities of services such as Thomson Reuters’ Open Calais, which also permits third parties to pass data via an API and receive structured and enriched results in return. A news article discussing a merger, for example, might be returned marked up with structured information on the companies involved, their key personnel, etc.
Given the backgrounds of Zemanta and Thomson Reuters, and the different data sets upon which they draw, it’s likely that a quite clear distinction will emerge in the use cases for which each is appropriate. It appears likely that Zemanta is more suited to the informal Web (pulling content from IMDb, Twitter and the like) whilst Calais will excel in mission-critical applications at the Fortune 500 and their ilk. Both add value in the mid-range, and only time will tell which is preferred moving forward.
Interestingly, both are making moves to embrace the Semantic Web’s Linking Open Data movement, which I’ve covered frequently here. Calais made announcements in that direction back in September, and an upcoming release of their service will make good on that. Zemanta’s press release today states;
“Zemanta fully supports the Linking Open Data initiative. It is the first API that returns disambiguated entities linked to dbPedia, Freebase, MusicBrainz, and Semantic Crunchbase. The data can be returned in the standard format of Semantic web – RDF. It is an ideal gateway from unstructured web to semantic web. This represents a major step ahead for efforts to connect the Web into a semantic web of objects.”
Zemanta CTO, Andraž Tori, commented;
“I see it as a stargate portal from unstructured content into the world of Semantic Web.”
Zemanta has already signed up a number of partners, and one of those is Freebase. Jamie Taylor (who recorded an early podcast about Freebase here) commented on the way that end users might benefit from accessing Freebase data via Zemanta;
“For publishers, the Zemanta API acts as a front door to the universe of open data on the web, facilitating the jump from unstructured text to semantic entities. You can take plain text, use the Zemanta API to resolve that text into strongly identified entities, and then query Freebase for detailed information about the mentioned people, places, movies, etc. Truly empowering.”
Use of the API is free for up to 10,000 API calls per month, with a subscription fee above that level.
December 8th, 2008
Mark Greaves of Vulcan sees business opportunities in the Semantic Web
Vulcan shares many traits with its reclusive founder, Paul Allen, yet behind the scenes the company is responsible for philanthropic support to research and community-building activities, as well as investing commercially in the likes of Radar Networks (the company behind Twine) and Evri.
Last week, I had the opportunity to talk with Mark Greaves, Vulcan’s Director of Knowledge Systems Research, and the resulting podcast was released earlier today.
Drawing upon a background that includes the likes of Boeing and DARPA, Greaves is persuaded of the benefits to be found in applying semantic technologies to existing business problems and processes.
Greaves identifies four broad areas ripe for development;
- Search
- Enterprise Information
- Social Semantic Web Applications
- Web-scale Knowledge Publishing
It will be interesting to see the extent to which Vulcan - and others - invest in these areas next year.
Paul Miller provides consultancy and analysis services at the interface between the worlds of Cloud Computing and the Semantic Web. See his full profile and disclosure of his industry affiliations.
Subscribe to The Semantic Web via Email alerts or RSS.
SponsoredWhite Papers, Webcasts, and Downloads
- Responding to Today's Demands with a Dynamic Infrastructure IBM Corp. Listen to this webcast to hear IBM executives and clients discuss a host ... Download Now
- Customer-Hosted Volume Activation Guide (Using KMS) Microsoft Microsoft? Volume Activation helps Volume Licensing customers automate and ... Download Now
- Fundamentals of Volume Activation Microsoft Gain a more thorough understanding--and learn what's new--on the Volume Activation process while deploying Windows 7 and Windows Server 2008. Download Now
Recent Entries
- Siri offers virtual assistance, with a little help from your iPhone
- Oracle delivers native support for Thomson Reuters’ OpenCalais service
- Moving Data.gov towards the Semantic Web
- New open source Semantic Web store from Garlik capable of enterprise scale
- Semantic Web Gang podcast looks back at the Semantic Technology Conference
Blogs From Our Sponsors
Most Popular Posts
Top Rated
Premier Vendor Content Whitepapers, webcasts & resources from our Power Center Sponsors
Archives
Favorite Links
ZDNet Blogs
- A Developer's View
- All About Microsoft
- The Apple Core
- Between the Lines
- BriefingsDirect
- Collaboration 2.0
- Dev Connection
- Digital Cameras & Camcorders
- Ed Bott's Microsoft Report
- Emerging Tech
- Enterprise Web 2.0
- Forrester Research
- Googling Google
- GreenTech Pastures
- Hardware 2.0
- Home Theater
- iGeneration
- Irregular Enterprise
- IT Project Failures
- Laptops & Desktops
- Lawgarithms
- Linux and Open Source
- Managing L'unix
- The Mobile Gadgeteer
- On Sustainability
- The Semantic Web
- Service Oriented
- Smartphones and Cell Phones
- Social Business
- Social CRM: The Conversation
- Software & Services Safari
- Software as Services
- Storage Bits
- Team Think
- Tech Broiler
- Technology and the Global Supply Chain
- Tom Foremski: IMHO
- The ToyBox
- Virtually Speaking
- The Web Life
- ZDNet Education
- ZDNet Government
- ZDNet Healthcare
- Zero Day
White Papers, Webcasts, and Downloads
- Total Economic Impact of SQL Server 2008 Upgrade Microsoft See how upgrading to Microsoft SQL Server 2008 can provide your company with an anticipated ROI of between 160 and 180 percent. Download Now
- Webinar: Best Practices for Application Virtualization with AdminStudio Flexera Software IT professionals, are you considering a move to Microsoft? App-V?? Watch ... Download Now
- Achieving Cost and Resource Savings with UC white paper Microsoft Read how UC can help your company save by reducing out-of-pocket expenses, consolidating communications infrastructure, and leveraging human capital. Download Now
SmartPlanet
- Thought-provoking progressive ideas on diverse topics that intersect with technology, business, and life, and matter to the world at large. Visit SmartPlanet
- More from IBM
- How to Drive Better Business Outcomes with Exceptional Web Experiences Download the eBook
- Driving Business Agility through SOA Connectivity & Integration Read the White Paper from IBM
- Linking Decisions and Information for Organizational Performance Read the Tom Davenport study




