March 16, 2016 communication messaging

Fleep, The Team Messaging App Built And Funded By Ex-Skypers, Flicks Monetization Switch

Source

Fleep, the team messaging app built and backed by a number of ex-Skype engineers, is flicking the monetisation switch today. A year after launching as a free public beta, the Estonian startup is introducing a freemium revenue model that sees users on its paid tier — €3 per month per user — get access to unlimited message history and files, while free users can only access messages from the last 30 days.

That cut off point, says Fleep co-founder and CEO Henn Ruukel, means Premium customers are still able to communicate with non-core team members or external partners on an ad-hoc and free basis, keeping the service as a viable alternative to email.

If we would have chosen a paid-only model it would limit usage and people would fall back to email, while always-free model would eventually end in indirect monetization through ads or something ugly,” he tells me in a Fleep chat. I think we were able to draw the line between Free and Premium so it feels fair and is easy to understand.”

In addition to unlimited message and file history, Fleep will soon add advanced management features” for subscribers to its paid-for Premium service, including team management and administered chats — giving company admins the ability to add and remove users from Fleep chats. That’s no doubt a much-requested enterprise feature, despite Fleep’s positioning as a team messaging app that retains the openness’ of email and its ad-hoc collaborative nature, coupled with the advantage of being a modern messaging platform, including better search, organisational tools and file management. After all, companies still need to maintain a high level of control over company communication, for competitive, political and compliance purposes.

Fleep-iPhoneA quick recap of how Fleep works and what makes it different from other team messaging apps, including upstarts such as Cotap, a messaging startup co-founded by two ex-Yammer executives, IMbox.me, backed by ex-Nokia President and CEO Olli Pekka Kallasvuo, TigerText, along with the likes of Microsoft-owned Yammer, Convo, Slack, and HipChat, all of which broadly play in the same space:

To start a new conversation in Fleep, you click on the create new’ button and enter the names of those who you want to see become part of the conversation. If they aren’t already using the app, you can enter their email address instead where they’ll be able to interface with the conversation via email by hitting reply, although in this instance their contribution also gets pulled into Fleep. In other words, not all participants need to be using the app.

To that end, I asked Ruukel why the decision was made to introduce paid tiers now? Mainly in order to provide clarity to our users,” he says. Since March when we launched Fleep apps, many have asked how much Fleep will cost, as this is one of the aspects to consider when selecting tools for the team.”

With today’s newly-introduced freemium model, Fleep is finally providing that clarity.

February 26, 2016 database

Inside Libpostal - a fast, multilingual, international street address parser trained on OpenStreetMap data · Mapzen

Source

For the past year, data scientist Al Barrentine has been working with Mapzen to crack one of the hardest problems in geocoding and place search: international address parsing. It’s resulted in Libpostal, a state-of-the-art, lightning-fast C library and statistical model for parsing and normalizing addresses around the world. The address parser alone is 98.9% accurate. And by virtue of being written in C, libpostal can be used directly from several popular languages, with bindings already written for Python, Go, Ruby, Java, and NodeJS.

The world is a big place, but Libpostal is a big step toward making it easier to find any place anywhere (and it only uses open data). We at Mapzen are incredibly excited to soon be using Libpostal as a key part of Mapzen Search and we can’t wait to see what you use it for!

Here, Al explains just how Libpostal came to be and, importantly, shares how it works so others can benefit from what he learned.


Street addresses are among the more quirky artifacts of human language, yet they are crucial to the increasing number of applications involving maps and location. Last year I worked on a collaboration with Mapzen with the goal of building smarter, more international geocoders using the vast amounts of local knowledge in open geographic data sets.

The result is libpostal: a multilingual street address parsing/normalization library, written in C, that can handle addresses all over the world.

Libpostal uses machine learning and is informed by tens of millions of real-world addresses from OpenStreetMap. The entire pipeline for training the models is open source. Since OSM is a dynamic data set with thousands of contributors and the models are retrained periodically, improving them can be as easy as contributing addresses to OSM.

Each country’s addressing system has its own set of conventions and peculiarities and libpostal is designed to deal with practically all of them. It currently supports normalizations in 60 languages and can parse addresses in more than 100 countries. Geocoding using libpostal as a preprocessing step becomes drastically simpler and more consistent internationally.

The core library is written in pure C, which means that in addition to having a small carbon footprint, libpostal can be used from almost any stack or programming language. There are currently bindings written for Python, Go, Ruby, Java, and NodeJS with more popular languages coming soon.

But let’s rewind for a moment.

Why we care about addresses

Addresses are the unique identifiers humans use to describe places, and are at the heart of virtually every facet of modern Internet-connected life: map search, routing/directions, shipping, on-demand transportation, delivery services, travel and accommodations, event ticketing, venue ratings/reviews, etc. There’s a $1B company in almost every one of those categories.

The central information retrieval problem when working with addresses is known as geocoding. We want to transform the natural language addresses that people use to describe places into lat/lon coordinates that all our awesome mapping and routing software uses.

Geocoding’s not your average document search. Addresses are typically very short strings, highly ambiguous, and chock full of abbreviations and local context. There is usually only one correct answer to a query from the user’s perspective (with the exception of broader searches like restaurants in Fort Greene, Brooklyn”). In some instances we may not even have the luxury of user input at all e.g. batch geocoding a bunch of addresses obtained from a CSV file, the Web or a third-party API.

Despite these idiosyncrasies, we tend to use the same full-text search engines for addresses as we do for querying traditional text documents. Out of the box, said search engines are terrible at indexing addresses. It’s easy to see how a naïve implementation could pull up addresses on St Marks Ave when the query was St Marks Pl” (both the words Ave” and Pl” have a low inverse document frequency and do not affect the rank much). Autocomplete might yield addresses on the 300 block of Main Street for a query of 30 Main Street”. Abbreviations like Saint” and St” which are not simple prefix overlaps might not match in most spellcheckers since their edit distance is greater than 2.

Typically we employ all sorts of heuristics to help with address matching: synonyms lists, special tokenizers, analyzers, regexes, simple parsers, etc. Most of these methods require changing the search engine’s config, and make US/English-centric, overly-simplified assumptions. Even using a full-text search engine in general won’t help in the server-side batch geocoding case unless we’re fully confident that the first result is the correct one.

Geocoding in 2016

Libpostal began with the idea that geocoding is more similar to the problem of record linkage than text search.

The question we want to be able to answer is: are two addresses referring to the same place?” Having done that, we can simultaneously make automated decisions in the batch setting and return more relevant results in user-facing geocoders.

This decomposes into two sub-problems:

  1. Normalization: the easiest way to handle all the abbreviated variations and ambiguities in addresses is to produce canonical strings suitable for machine comparison, i.e. make 30 W 26th St” equal to Thirty West Twenty-Sixth Street”, and do it in every language.
  2. Parsing: some components of an address are more essential than others, like house numbers, venue names, street names, and postal codes. Beyond that, addresses are highly structured and there are multiple redundant ways of specifying/qualifying them. London, England” and London, United Kingdom” specify the same location if parsed to mean city/admin1 and city/country respectively. If we already know London, there would be no point in returning addresses in Manchester simply because it’s also in the UK.

Once we’ve got canonical address strings segmented into components, geocoding becomes a much simpler string matching problem, the kind that full-text search engines and even relational/non-relational databases are good at handling. With a little finesse one could conceivably geocode with nothing but libpostal and a hash table.

To see how that’s possible, the next two sections describe in detail how libpostal addresses (pun very much intended) the normalization and parsing problems respectively.

Multilingual address normalization

Normalization is the process of converting free-form address strings encountered in the wild into clean normalized forms suitable for machine comparison. This is primarily deterministic/rule-based.

Address normalization using libpostal’s Python bindings

There are several steps involved in making normalization work across so many different languages. I’ll mention the notable ones.

Multilingual tokenization

Tokenization is the process of segmenting text into words and symbols. It is the first step in most NLP applications, and there are many nuances. The tokenizer in libpostal is actually a lexer implementing the Unicode Consortium’s TR-29 spec for unicode word segmentation. This method handles every script/alphabet, including ideograms (used in languages not separated by whitespace e.g. Chinese, Japanese, Korean), which are read one character at a time.

The tokenizer is inspired by the approach in Stanford’s CoreNLP i.e. write down a bunch of regular expressions and use compile them into a fast DFA. We use re2c, a light-weight scanner generator which often produces C that’s as fast as a handwritten equivalent. Indeed, tokenization is quite fast, chunking through > 2 million tokens per second.

Abbreviation expansion

Almost every language on Earth uses abbreviations in addresses. Historically this had to do with width constraints on things like street signs or postal envelopes. Digital addresses face similar constraints, namely that they are more likely than other types of text to be viewed on a mobile device.

Abbreviations create ambiguity, as there are multiple ways of writing the same address with different degrees of verbosity: W St Johns St”, W Saint Johns St”, W St Johns Street”, and West Saint Johns Street” are all equivalent. There are similar patterns in most languages.

For expanding abbreviations to their canonical forms, libpostal contains a number of per-language dictionaries, which are simple text files mapping Rd” to Road” in 60 languages. Each word/abbreviation can have one or more canonical forms (“St” can expand to Street” or Saint” in English), and one or more dictionary types: directionals, street suffixes, honorifics, venue types, etc.

Dictionary types make it possible to control which expansions are used, say if the input address is already separated into discrete fields, or if using libpostal’s address parser to the same effect. With dictionary types, it’s possible to apply only the relevant expansions to each component. For instance, in an English address, St.” always means Saint” when used in a city or country name like St. Louis” or St. Lucia” and will only be ambiguous when used as part of a street or venue/building name.

The dictionaries are compiled into a trie data structure, at which point a fast search algorithm is used to scan through the string and pull out matching phrases, even if they span multiple words (e.g. “State Route”). This type of search also allows us to treat multi-word phrases as single tokens during address parsing.

Ideographic languages like Japanese and Korean are handled correctly, even though the extracted phrases are not surrounded by whitespace. So are Germanic languages where street suffixes are often appended onto the end of the street name, but may optionally be separated out (Rosenstraße and Rosen Straße are equivalent). All of the abbreviations listed on the OSM Name Finder wiki are implemented as of this writing, plus many more.

At the moment, libpostal does not attempt to resolve ambiguities in addresses, and often produces multiple potential expansions. Some may be nonsensical (“Main St” expands to both Main Street” and Main Saint”), but the correct form will be among them. The outputs of libpostal’s expand_address can be treated as a set and address matching can be seen as a doing a set intersection, or a JOIN in SQL parlance. In the search setting, one should index all of the strings produced, and use the same code to normalize user queries before sending them to the search server/database.

Future iterations of expand_address will probably use OpenStreetMap (where abbreviation is discouraged) to build classifiers for ambiguous expansions, and include an option for outputs to be ranked by likelihood. This should help folks who need a single best” expansion e.g. when displaying the results on a map.

Address language classification

Abbreviations are language-specific. Consider expanding the token St.” in an address of unknown language. The canonical form would be Sankt” in German, Saint” in French, Santo” in Portuguese, and so on.

We don’t actually want to list all of these permutations. In most user-facing geocoders, we likely know the language ahead of time (say from the user’s HTTP headers or current location). However, in batch geocoding, we don’t know the language of any of our input addresses, so will need a classifier to predict languages automatically using only the input text.

Language detection is a well-studied problem and there are several existing implementations (such as Chromium’s compact language detector) which achieve very good results on longer text documents such as Wikipedia articles or webpages. Unfortunately, because of some of the aforementioned differences between addresses and other forms of text, packages like CLD which are trained on webpages usually expect more/longer words than we have in an abbreviated address, and will often get the language wrong or fail to produce a result at all.

So we’ll need to build our own language classifier and train it specifically for address data. This is a supervised learning problem, which means we’ll need a bunch of address-related input labeled by language, like this:

  de  Graf-Folke-Bernadotte-Straße
  sv  Tollare Träskväg
  nl  Johannes Vermeerstraat Akersloot
  it  Strada Provinciale Ca' La Cisterna
  da  Østervang  Vissenbjerg
  nb  Lyngtangen Egersund
  en  Wood Point Road
  ru  улица Солунина
  ar  جادة صائب سلام
  fr  Rue De Longpré
  he  השלום
  ms  Jalan Sri Perkasa
  cs  Jeřabinová  Rokycany
  ja  山口秋穂線
  ca  Avinguda Catalunya
  es  calle Camilo Flammarión
  eu  Mungialde etorbidea
  pt  Rua Pedro Muler Faria</pre>

Sounds great, but where are we going to find such a data set? In libpostal, the answer to that question is almost always: use OpenStreetMap.

OSM has a great system when it comes to languages. By default the name of a place is the official local language name, rather than the Anglicized/Latinized name. Beijings default name for instance is 北京市 rather than Beijing” or Peking.”

Some addresses in OSM are explicitly labeled by language, especially in countries with multiple official street sign languages like Hong Kong, Belgium, Algeria, Israel, etc. In cases where a single name is used, we build an R-tree polygon index that can answer the question: for a given lat/lon, which official and/or regional language(s) should I expect to see? In Germany we expect addresses to be in German. In some regions of Spain, Catalan or Basque or Galician will be returned as the primary language we expect to see on street signs, whereas (Castilian) Spanish is used as a secondary alternative. In cases where languages are equally likely to appear, the language dictionaries in libpostal are used to help disambiguate. Lastly, street signs are always be written in the languages spoken by the majority of people, a vestige of linguistic imperialism, and the language index accounts for this as well.

All said and done, this process produces around 80 million language-labeled address strings. From there we extract features (informative attributes of the input which help to predict the output) similar to those used in Chromium and the language detection literature: sequences of 4 letters or 1 ideogram, whole tokens for words shorter than 4 characters, and a shortcut for unicode scripts mapping to a single language like Greek or Hebrew. Specific to our use case, we also include entire phrases matching certain language dictionaries from libpostal.

We then train a multinomial logistic regression model (also known as softmax regression) using stochastic gradient descent and a few sparsity tricks to keep training times reasonably fast. Logistic regression is heavily used in NLP because unlike Naïve Bayes, it does not make the assumption that input features are independent, which is unrealistic in language.

Another nice property of logistic regression is that its output is a well-calibrated probability distribution over the labels, not just normalized scores that look like probabilities if you close one eye and squint with the other.” With real probabilities we can implement meaningful decision boundaries. For instance, if the top language returned by the classifier has a probability of 0.99, we can safely ignore the other language dictionaries, whereas if it makes a less confident prediction like 0.62 French and 0.33 Dutch, we might want to throw in both dictionaries. Though the latter type of output should not be interpreted as the distribution of languages in the address itself (as in a multi-label classifier), results with multiple high-probability languages are most often returned in cases like Brussels where addresses actually are written in two languages.

Numeric expression parsing

In many addresses, particularly on the Upper East Side of Manhattan it seems, numbers are written out as words e.g. “Eighty-sixth Street” instead of 86th Street.” Libpostal uses a simplified form of the Rule-based Number Format (RBNF) in CLDR which spells out the grammatical rules for parsing/spelling numbers in various languages.

Rather than try to exhaustively list all numbers and ordinals that might be used in an address, we supply a handful of rules which the system can then use to parse arbitrary numbers.

In English, when we see the word hundred”, we multiply by any number smaller than 100 to the left and add any number smaller than 100 to the right. There’s a recursive structure there. If we know the rule for the hundreds place, and we know how to parse all numbers smaller than 100, then we can count” up to 1000.

Numeric spellings can get reasonably complicated in other languages. French for instance uses some Celtic-style numbers which switch to base 20, so quatre-vignt-douze” (“four twenties twelve”) = 92. Italian numbers rarely contain spaces so milleottocentodue” = 1802. In Russian, ordinal numbers can have 3 genders. Libpostal parses them all, currently supporting numeric expressions in over 30 languages.

Roman numerals can be optionally recognized in any language (so IX normalizes to 9), though they’re most commonly found in Europe in the titles of popes, monarchs, etc. In most cases Roman numerals are the canonical form, and can be ambiguous with other tokens (a single I” or V” could also be a person’s middle initial), so a version of the string with unnormalized Roman numerals is added as well.

Transliteration

Many addresses around the world are written in a non-Latin scripts such as Greek, Hebrew, Cyrillic, Han, etc. In these cases, addresses can be written in the local alphabet or transliterated i.e. converted to a Latin script equivalent. Because the target script is usually Latin, transliteration is also sometimes known as Romanization.”

For example, Тверская улица” in Moscow transliterates to Tverskaya ulitsa.” A restaurant website would probably use the former for its Russian site and the latter for its international site. Street signs in many countries (especially those who’ve at some point hosted a World Cup) will typically list both versions, at least in major cities.

Libpostal takes advantage of all the transliterators available in the Unicode Consortium’s Common Locale Data Repository (CLDR), again compiling them to a trie for fast runtime performance. The implementation is lighter weight than having to pull in ICU, which is a huge dependency and may conflict with system versions.

Each script or script/language combination can use one or more different transliterators. There are for instance several differing standards for transliterating Greek or Hebrew, and libpostal will try them all.

There’s also a simpler transliterator, the Latin to ASCII transform, which converts œ” to oe”, etc. This is in addition to standard Unicode normalization, which would decompose ç” into c” and COMBINING CEDILLA (U+0327)”, and optionally strip the diacritical mark to make it just c.” Accent stripping is sort of an ignorant American” type of normalization, and can change the pronunciation or meaning of words. Still, sometimes addresses have to be written in an ASCII approximation (because keyboards), especially with travel-related searching, so we do strip accent marks by default, with an optional flag to prevent it.

Some countries actually translate addresses into English (something like Tverskaya Street”), creating further ambiguity. At the cost of potentially adding a few bogus normalizations, libpostal can handle such translations by simply adding English dictionaries as a second language option for certain countries/languages/scripts.

International address parsing

Parsing is the process of segmenting an address into components like house number, street name, city, etc. Though many address parsers have been written over the years, most are rule-based and only designed to handle US addresses. In libpostal we develop the first NLP-based address parser that works well internationally.

Parsing addresses with libpostal’s command-line client

The NLP approach to address parsing

International address parsing is something we could never possibly hope to solve deterministically with something like regex. It might work reasonably well for one country, as addresses tend to be highly structured, but there are simply too many variations and ambiguities to make it work across languages. This sort of problem is where machine learning, particularly in the form of structured learning, really shines.

Most NLP courses/tutorials/libraries focus on models and algorithms, but applications on real-world data sets are not in great abundance. Libpostal provides an example of what an end-to-end production-quality NLP application looks like. I’ll detail the relevant steps of the pipeline below, all of which are open source and published to Github as part of the repository.

Creating labeled data from OSM

OpenStreetMap addresses are already separated into components. Here’s an example of OSM tags as JSON:

{
    "addr:housenumber": "30",
    "addr:postcode": "11217",
    "addr:street": "Lafayette Avenue",
    "name": "Brooklyn Academy of Music"
}

This is exactly the kind of output we want our parser to produce. These addresses are hand-labeled by humans and there are lots of them, more than 50 million at last count.

We want to construct a supervised tagger, meaning we have labeled text at training time, but only unlabeled text (geocoder input) at runtime. The input to a sequence model is a list of tagged tokens. Here’s an example of the for the address above:

Brooklyn/HOUSE Academy/HOUSE of/HOUSE Music/HOUSE
30/HOUSE_NUMBER Lafayette/ROAD Avenue/ROAD Brooklyn/CITY NY/STATE 11217/POSTCODE

At runtime, we’ll only expect to see Brooklyn Academy of Music, 30 Lafayette Avenue, Brooklyn, NY 11217”, potentially without the commas. With a little creativity, we can reconstruct the free-text input, and tag each token to produce the above training example.

Notice that the original OSM address has no structure/ordering, so we’ll need to encode that somewhere. For this, we can use OpenCage’s address-formatting repo, which defines address templates for almost every country in the world, with coverage increasing steadily over time. In the US, house number comes before street name (“123 Main Street”), whereas in Germany or Spain it’s the inverse (“Calle Ruiz, 3”). The address templates are designed to format OSM tags into human-readable addresses in every country. This is a good approximation of how we expect geocoder input to look in those countries, which means we have our input strings. I’ve personally contributed a few dozen countries to the repo and it’s getting better coverage all the time.

Also notice that in the OSM address, city, state, and country are missing. We can fill in the blanks” by checking whether the lat/lon of the address is contained in certain administrative polygons. So that we don’t have to look at every polygon on Earth for every lat/lon, we construct an R-tree to quickly check bounding box containment, and then do the slower, more thorough point-in-polygon test on the bounding box matches. The polygons we use are a mix of OSM relations, Quattroshapes/GeoNames localities, and Zetashapes for neighborhoods.

Making the parser robust

Because geocoders receive a wide variety of queries, we then perturb the address in several ways so the model has to train on many different kinds of input. With certain random probabilities, we use:

  • Alternate names: for some of the admin polygons (e.g. “NYC”, New York”, New York City”) so the model sees as many forms as possible
  • Alternate language names: OSM does a great job of handling language in addresses. By default a tag like name” can be assumed to be in the local official language, or hyphenated if there’s more than one language. Something like name:en” would be the English version. In countries with multiple official languages like Hong Kong, addresses almost always have per-language tags. We use these whenever possible.
  • Non-standard polygons: like boroughs, counties, districts, neighborhoods, etc. which may be occasionally seen in addresses
  • ISO codes and state abbreviations: so the parser can recognize things like Berlin, DE and Baltimore, MD
  • Component dropout: we usually produce 2–3 different versions of the address with various components removed at random. This way the model also has to learn to parse simple city, state” queries alongside venue addresses, so it won’t get overconfident e.g. that the first token in an address is always a venue name.

Structured learning

In the structured learning, we typically use a linear model to predict the most likely tag for a particular word given some local and contextual attributes or features. What differentiates structured learning from other types of machine learning is that in structured learning, the model’s prediction for the previous word can be used to predict the current word. In similar tasks like part-of-speech tagging or named entity recognition, we typically design feature functions” which take the following parameters:

  1. The entire sequence of words
  2. The current index in that sequence
  3. The predicted tags for the previous two words

The function then returns a set of features, usually binary, which might help predict the best tag for the given word.

The tag history is what makes sequence learning different from other types of machine learning. Without the tag history, we could come up with the features for each word (even if they use the surrounding words), and use something like logistic regression. In a sequence model, we can actually create features that use the predicted tag of the previous word.

Consider the use of the word Brooklyn.” In isolation, we could assume it to mean the city, but it could be many other things e.g. Brooklyn Avenue, The Brooklyn Museum, etc. If we see Brooklyn” and the last tag was HOUSE_NUMBER, it’s very likely to mean Brooklyn the street name. Similarly, if the last tag was HOUSE (our label for place/building name), it’s likely that we’re inside a venue name e.g. “The Brooklyn Museum.”

Features

The simplest and most predictive feature is usually the current word itself, but having the entire sequence means there can be bigram/trigram features, etc. This is especially helpful in a case like Brooklyn Avenue” where knowing that the next word is Avenue” may disambiguate words used out of their normal context, or help determine that a rare word is a street name. In a French address, knowing that the previous word was Avenue” is equally helpful as in Avenue des Champs-Élysées.”

Training the model for multiple languages entails a few more ambiguities. Take the word de.” In Spanish it’s a preposition. If we’re lowercasing the training data on the way in, it could also be an abbreviation for Delaware (“wilmington de”) or Deutschland (“berlin de”). Again, knowing the contextual words/tags is quite helpful.

In libpostal, we make heavy use of the multilingual address dictionaries used above in normalization as well as place name dictionaries (aka gazetteers) compiled from GeoNames and OSM. We group known multiword phrases together so e.g. “New York City” will be treated as a single token. For each phrase, we store the set of tags it might refer to (“New York” can be a city or a state), and which one is most likely in the training data. Context features are still necessary though as many streets take their name from a proper place like Pennsylvania Avenue,” Calle Uruguay” or Via Firenze.”

We also employ a common trick to capture patterns in numbers. Rather than consider each number as a separate word or token, we normalize all digits to an uppercase D” (since we’re lowercasing, this doesn’t conflict with the letter d”). This allows us to capture useful patterns in numbers and let them share statistical strength. Some examples might be DDDDD or DDDDD-DDDD” which are most likely US postal codes. This way we don’t need many training examples of 90210” specifically, we just know it’s a five digit number. GeoNames contains a world postal code data set, which is also used to identify potential valid postal codes. Some countries like South Africa use 4-digit postal codes, which can be confused for house numbers, and the GeoNames postal codes help disambiguate.

The learning algorithm

We use the averaged perceptron popularized by Michael Collins at Columbia, which achieves close to state-of-the-art accuracy while being much faster to train than fancier models like conditional random fields. On smaller training sets, the additional accuracy might be worth slower training times. On > 50M examples, training speed is non-negotiable.

The basic perceptron algorithm uses a simple error-driven learning procedure, meaning if the current weights are predicting the correct answer, they aren’t modified. If the guess is wrong, then for each feature, one is added to the weight of the correct class and one is subtracted from the weight of the predicted/wrong class. The learning is done online, one example at a time. Since the weight updates are very sparse and occur only when the model makes a mistake, training is very fast.

In the averaged perceptron, the final weights are then averaged across all the iterations. Without averaging it’s possible for the basic perceptron to spend so much of its time altering the weights to accommodate the few examples it gets wrong that it produces an unreasonable set of weights that don’t generalize well to new examples (a.k.a. overfitting). In this way, averaging has a similar effect to regularization in other linear models. As in stochastic gradient descent, the training examples are randomly shuffled before each pass, and we make several passes over the entire training set.

Though quite simple, this method is surprisingly competitive in part-of-speech tagging, the existing NLP task that’s closest to address parsing, and has by far the best speed/accuracy ratio of the bunch.

Evaluation

In part-of-speech tagging, simple per-token accuracy is the most intuitive metric for evaluating taggers and is used in most of the literature. For address parsing, since we’ll want to use the parse results downstream as fields in normalization and search, a single mistake changes the JSON we’ll be constructing from the parse. Consider the following mistake:

Brooklyn/HOUSE Academy/HOUSE of/ROAD Music/HOUSE
30/HOUSE_NUMBER Lafayette/ROAD Avenue/ROAD Brooklyn/CITY NY/STATE 11217/POSTCODE

In a full-text search engine like Elasticsearch, it might still work to search the name field with [“Brooklyn Academy”, Music”] plus the other fields and still get a correct result, but if we want to create a structured database from the parses or hash the fields and do a simple lookup, this parse is rendered essentially useless.

The evaluation metric we use is full-parse accuracy, meaning the fraction of addresses where the model labels every single token correctly.

On held-out data (addresses not seen during training), the libpostal address parser currently gets 98.9% of full parses correct. That’s a single model across all the languages, variations, and field combinations we have in the OSM training set.

Future improvements

The astute reader will notice that there’s still an open question here: how well does the synthesized training set approximate real geocoder input? While that’s difficult to measure directly, most of the decisions in constructing the training set thus far have been made by examining patterns in real-world addresses extracted from the Common Crawl, as well as user queries contributed to the project by a production geocoder.

There’s still room for improvement of course. Not every country is represented in the address formatting templates (though coverage continues to improve over time). Most notably, countries using the East Asian addressing system like China, Japan, and South Korea are difficult because the address format depends on which language/script is being used, necessitating some structural changes to the address-formatting repo. In OSM these addresses are not always split into components, possibly residing in the addr:full” tag. However, since each language uses specific characters to delimit address components, it should be possible to parse the full addresses deterministically and use them as training examples.

The libpostal parser also doesn’t yet support apartment/flat numbers as they’re not included in most OSM addresses (or the address format templates for that matter). The parser typically labels them as part of the house number or street field. For geocoders, apartment numbers aren’t likely to turn up much as people tend to search at the level of the house/building number, but they may be unavoidable in batch geocoding. Supporting them would be relatively straightforward either by adding apartment or floor numbers to some of the training examples at random (without regard to whether those apartments actually exist in a particular building or not), or by parsing the addr:flats key in OSM. The context phrases like Apt.” or Flat” can be randomly sampled from any language in libpostal with a unit_types” dictionary.

Conclusions

I’m hoping that libpostal will be the backbone for many great geocoders and apps in years to come. With that in mind, it’s been designed to be:

  1. International/multilingual
  2. Technology and stack independent
  3. Based on open data sets and fully open source

International by design, not as an afterthought

Almost every geocoder bakes in various myopic assumptions e.g. that addresses are only in the US, English, Latin script, the Global North, the bourgeoisie, etc.

Fully embracing L10N/I18N (localization/internationalization) means that there is no excuse for excluding people based on the languages they speak or the countries in which they live. An extra degree of rigor is required in recognizing and eliminating our own cultural biases.

There are of course always constraints on time and attention, so libpostal prioritizes languages in a simple, hopefully democratic way. Languages are added in priority order by the number of world addresses they cover, approximated by OpenStreetMap.

Usable on any platform

Libpostal is written in C mostly for reasons of portability. Almost every conceivable programming language can call into C code. There are already libpostal bindings for Python and NodeJS, and it’s quite easy to write bindings for other languages.

Informed completely by open data

Libpostal makes use of several great open data sets to construct training examples for the address parser and language classifier:

  • OpenStreetMap is used extensively by libpostal to create millions of training examples of parsed addresses and language classifications.
  • GeoNames is used by the address parser as a place name and postal code gazetteer, and will also be used for geographic name disambiguation in an upcoming release.
  • Quattroshapes and Zetashapes polygons are used in various places to add additional administrative and local boundary names to the parser training set. Zetashapes neighborhood polygons were particularly useful since neighborhoods are simple points in OSM.

All of the preprocessing code is open source, so researchers wanting to build their own models on top of open geo data sets are welcome to pursue it from any avenue (the puns just keep getting better) they choose.

The beauty of using these living, open, collaboratively edited data sets is that the models in libpostal can be updated and improved as the data sets improve. It also provides a great incentive for users of the library to support and contribute to open data.

Fin

You made it! The only thing left to do, if you haven’t already, is check out libpostal on Github: https://github.com/openvenues/libpostal.

If you want to contribute and help improve libpostal, you don’t have to know C, or any programming language at all for that matter. For non-technical folks, the easiest way to contribute is to check out our language dictionaries, which are simple text files that contain all the abbreviations and phrases libpostal recognizes. They affect both normalization and the parser. Find any language you speak (or add a directory if it’s not listed) and edit away. Your work will automatically be incorporated into the next build.

Libpostal is already scheduled to be incorporated into at least 3 geocoding applications written in as many languages. If you’re using it or considering it for your project/company, let us know.

Happy geocoding!

February 25, 2016 github hosting

How to Use Github for Hosting Files

Source

Learn how to use Github as a free file hosting service. You can upload images, PDFs, document or files of any other form into your Github from the browser.

Github, in simple English, is a website for hosting source code. The site is built for programmers and, if you are not one, it is highly unlikely that you have ever used Github. Repositories and Forks, the basic building blocks of Github, may seem like second-nature to developers but, for everyone else, Github continues to be a complicated beast.

Github isn’t just a place for developers though. The site can be used a writing platform. It can host HTML websites. You can use Github to visually compare the content of two text files. The site’s Gist service can used for anonymous publishing and as a tasklist. There’re so many things do on Github already and you can how use it as a free file hosting service as well.

How to Host Files on Github

It takes few easy steps to turn your Github into a file repository. You can upload files from the browser and you can add collaborators so they can also upload files to a common repository (similar to shared folders in Google Drive). The files are public so anyone can download them with a direct link. The one limitation is that the individual files cannot be larger than 25 MB each. There are no known bandwidth limits though.

Step 1: Go to github.com and sign-up for a free account, if you don’t have one. Choose the free plan as that’s all we need for hosting our files.

Step 2: Click the New Repository” button, or go to github.com/new, to create a new repository for hosting your files. You can think of a repository as a folder on your computer.

[Github for File Hosting][3]

[Github for File Hosting][4]
Step 3: Give your repository a name and a description and click the Create button. It helps to have a description as it will help others discover your files on the web. You can have Private repositories too but that requires a monthly subscription.

Step 4: Your repository will initially be empty. click the Import Code button on the next screen to initialize the repository.

[Import code into Github][3]

[Import code into Github][5]
Step 5: Paste the URL _https://github.com/labnol/files.git_ into the repository field and click Begin Import to create your Github repository for hosting files.

Upload Files to Github

Your Github repository is now ready. Click the Upload Files files button and begin uploading files. You can drag one or more files from the desktop and then click Commit Changes to publish the files on the web. Github will accept any file as long as the size is within the 25 MB limit.

Github has a built-in previewer for PDF, text and image files (including [animated GIFs][6]) so anyone can view them without downloading the actual file. Else there’s a simple URL hack to get the raw (downloadable) version of any file hosted on Github.

[Upload Files to Github][3]

[Upload Files to Github][7]

Direct URLs for Github Files

After the file has been uploaded to Github, click the filename in the list and you’ll get the file’s URL in the browser’s address. Append ?raw=true to the URL and you get a downloadable / embeddable version.

For instance, if the file URL is github.com/labnol/files/hello.pdf, the direct link to the same file would be github.com/labnol/files/hello.pdf?raw=true. If the uploaded file is an image, you can even embed it in your website using the standard img tag.

Here’s a sample [file repository][8] on Github. The T-Rex image is [here][9] and the direct link is [here][10]. You can go to the Repository settings and add one or more collaborators. They’ll get write access to your repository and can then add or delete files.

[3]: [4]: [5]: [6]: http://www.labnol.org/tag/gif/ [7]: [8]: https://github.com/labnol/files [9]: [10]: https://github.com/labnol/files/blob/master/trex.jpg?raw=true

February 10, 2016 dropbox

15 Things You Didn’t Know You Could Do with Dropbox

Source

Just when you thought Dropbox couldn’t get any better, it has.

Many interesting cloud storage services have come and gone, but Dropbox is probably the one that’s been here the longest. And now it has upped its game with a host of new features. Let’s explore some of them from 2015 as well as some old but lesser-known ones. What we’re saying is let’s discover more stuff that you didn’t know you could do in and with Dropbox.

1. Request Files from Anyone

Sharing files saved in your Dropbox has always been easy. Collecting files in Dropbox from people? Not so much. You had to rely on third-party services for quite a long time…until Dropbox introduced its own file request feature. The best thing about it is that you can gather files even from people who don’t have a Dropbox account. No reason to force them to sign up for one, is there?

To initiate a file request, first head straight to your Dropbox account and click on File Requests in the sidebar to go to the file requests page. See that big blue plus icon there? Click on it create a file request.

file-requests-section

You’ll have to specify a catchall name for the files that you want to collect. Dropbox creates a new folder with this name to direct the incoming files to. You can also use an existing folder instead.

create-file-request

For every file request that you create, you’ll get a unique link to share with the people you want to receive files from. Ensure that you have enough space in your Dropbox account for the incoming files. Otherwise, the person sending the files will encounter an error message.

Don’t worry about the privacy settings for the received files. Only you can see them, and later share them if and when you want to.

I used the @Dropbox File Request feature this morning, and it worked perfectly. Consider me impressed!

— Devon Michael Dundee (@devondundee) January 14, 2016

If you’re on the receiving end of a file request, you’ll get an email with a link to upload the requested files. Click on it and Dropbox will walk you through the straightforward upload process. You’ll have to limit the file size to 2GB if you’re sending it to a Dropbox Basic user and to 10 GB if you’re sending it to a Pro or Business user.

We also recommend giving Balloon a try, if you don’t mind ditching the built-in file request feature in favor of a third-party app.

2. Preview Photoshop and Illustrator Files

Has someone shared a PSD file or an AI file with you on Dropbox? You don’t need access to the right Adobe software to preview it. You can do that right from Dropbox’s web interface, thanks to the interactive file preview feature introduced mid-2015.

Click on the file you want to preview and you’ll get an image toolbar that you can use to zone in on any portion of the preview.

Coolest surprise of the day? Being able to preview an @Illustrator file in @Dropbox on #iOS. Geeeeenius!!

— Sophie Exintaris (@eurydice13) December 3, 2015

You can preview files not only in PSD and AI formats, but also in PNG, JPG, EPS, SVG, and BMP. But, the previews for certain formats like PSD, AI, and SVG will be sharper and clearer than for the rest. The file preview feature also allows you to preview PDFs, slideshows, videos, and more.

pdf-preview

If you’re a creative professional, the preview feature ensures that you don’t have to worry about compressing high-resolution files or converting them to other, more easily viewable formats for sharing with clients. Share a Dropbox link to the design file and be done with it. Your client can preview the file (in full resolution!) and leave feedback on it from Dropbox on the web.

3. Rejoin Shared Folders

Let’s say you left a shared folder, accidentally or otherwise, by deleting it from your Dropbox, and now you want back in. Regaining access to that folder is as simple as clicking on Sharing in the sidebar and then clicking on the Rejoin link next to the folder you want fresh access to.

rejoin-shared-folder

Remember, deleting files inside the shared folder works differently from deleting the shared folder itself. The former will make the files disappear from everybody’s else Dropbox account as well, but then again, anyone with access to the shared folder can restore them.

4. Find Files Faster with Dropbox Recents

You don’t have to dig through folder after folder to find a Dropbox file that you just edited. You’ll find a link to it under Recents in the sidebar. This section keeps an updated list of files that you have opened or modified recently. Share, download, comment, delete, or even view previous versions of the file straight from this list.

dropbox-recents

5. Work as a Team

Many Dropbox users — solopreneurs, for example — use the Basic and Pro versions of Dropbox for business. If you’re one of those users, congratulations. You can now collaborate better on projects using the new Team feature.

After you create a team, you’ll be able to add members to it, share files and folders with them, and create sub-folders for better organization. As the team administrator you get granular control over file and folder permissions. Also, you’re sure to appreciate the ability to link your work and personal Dropbox accounts and switch between them easily without having to log out of either.

Having 2 different Dropbox accounts in one for Personal/Work is awesome. Awesome new Team feature @Dropbox!

— Maarten Busstra (@qtbox) October 28, 2015

Your work projects are not the only ones that can benefit from this collaboration feature. Personal projects also can. Have a family vacation coming up? Or a wedding? Or a friend’s birthday? Create a Dropbox team and get started on the planning!

6. Discuss Files You’re Viewing

You have probably noticed that Dropbox files on the web now come with a commenting mechanism. If you haven’t, shift your attention to the right sidebar when you have a file or file preview open, and there it is.

As is standard procedure on the web these days, you can @mention someone to get their attention, and in this case, to get their inputs on the file. They’ll receive an email notification about it and can leave a comment on the file even if they aren’t a Dropbox user.

comments-section

The added advantage is that if it’s a Microsoft Office file that you’re discussing, you can edit it right there based on the feedback, thanks to the Dropbox-Office Online integration. Your edits will automatically get saved back to Dropbox.

7. Sync Files Faster

By default, Dropbox limits the bandwidth allocated to the files being uploaded to your account. If you want to take advantage of your network’s higher capacity, you can remove this limit altogether or set a custom one from Dropbox’s settings.

To remove bandwidth limits for file uploads on a Mac, first open Preferences from Dropbox’s menu bar icon.

Next, switch to the Network tab and click on the Change Settings button next to Bandwidth: Now select the radio button next to Don’t limit, or if you want to specify a limit, select the radio button next to Limit to and type in an upload speed. You can also limit the download rate from the same section. Hit the Update button once you have made the changes.

dropbox-upload-limit

To access the bandwidth settings on Windows 7 and above, click on the Dropbox icon in the system tray and go to Preferences > Bandwidth.

8. Instantly Delete Sensitive Files for Good…

Files that you delete from your Dropbox don’t disappear immediately from your computer or your Dropbox account. They get queued up for permanent deletion and stay part of the Dropbox ecosystem for at least 30 days. The deleted files also stay in the cache folder (.dropbox.cache) within Dropbox’s root folder on your computer for three days.

Note: If you have a Pro account with Extended Version History, the deleted files stay in the online deletion queue for up to one year.

My #Dropbox is acting weird. Even though I delete a folder, it keeps appearing again. :/

— Arun Sathiya (@iarunsb) January 1, 2016

If the files you deleted contain sensitive data, you might want to clear them out from the deletion queue manually. To do so, go to the home page of your account and click on the trash icon to the left of the search box. This displays the deleted files and they appear grayed out.

Now select a binned file that you want to erase permanently and click on the Permanently delete… option in the menu bar at the top. Do this for each file that you want to erase right away. Of course, you can select multiple files using Ctrl on Windows or cmd on a Mac.

dropbox-permanently-delete-files

Here comes another important step: getting rid of the deleted files from Dropbox’s cache folder. You can’t see this folder unless your system is set to show hidden files. You’ll need to access it and once again delete the files from there to get rid of them for good. Of course, if you do nothing, Dropbox will still clear the cache folder in three days’ time.

Based on whether you’re using Windows, Mac, or Linux, you’ll have to look up Dropbox’s instructions to reveal the cache folder on your computer.

Warning: You can’t recover any of the files you have deleted using the steps above, but someone with access to your computer and a good recovery software might be able to.

  1. Be 100% sure that you want to delete a file before you delete it.
  2. Look for a more advanced security solution to remove even the most deeply hidden remnants of deleted files.

9. Add a 4-Digit PIN to the Dropbox App on Your Mobile

You know all about protecting your Dropbox account with two-factor authentication and you have set it up already, right? Have you also secured the Dropbox app on your phone or tablet with a PIN or passcode? The passcode feature is not new, but it’s one that many people overlook.

Set a passcode for the Dropbox app now via Dropbox settings > Advanced Features > Configure Passcode on your Android device or via Dropbox settings > Passcode Lock on your iPhone. For iPads and Windows tablets, here are the instructions to set a passcode.

dropbox-passcode-android

Are you a Pro user? Then in addition to setting a passcode, you can enable the setting to remotely erase all Dropbox data on that device after 10 failed attempts at entering the correct PIN. This can prove helpful if your phone ever falls into the wrong hands. There’s a catch though. You can proceed with the remote data wipe only if the device is online.

Also, if you’re a Basic user, you have to content yourself with unlinking the lost device by clicking on the x icon next to its name under Dropbox Settings > Security > Devices.

10. Carry Your Bookmarks Everywhere

Dropbox being such a great way to sync anything, we have all come up with various makeshift ways to sync bookmarks to the cloud. But we don’t need them anymore, because Dropbox has now added a feature to do just that.

You can now drag and drop links to Dropbox on the web or on your computer. They get backed up just like your files do, so you can open them from any location.

dropbox-bookmark

Unfortunately, clicking on a bookmark from Dropbox’s web interface loads a preview page for the bookmark instead of the link suggested by the bookmark. That’s why we recommend using the bookmark’s context menu to open the link in a new tab.

You’ll really appreciate the convenience of this bookmarking feature when you’re collaborating with someone on a project and have a bunch of shared links to keep track of.

11. Host a Podcast from Dropbox with JustCast

We recently shared an exhaustive guide on how to start a successful podcast. If you’re gearing up to start a podcast yourself and are on the lookout for a decent, easy-to-manage podcast host, your search ends here — with JustCast, which is ridiculously simple to use.

Once you connect JustCast to your Dropbox, a folder named JustCast will appear in /Dropbox/Apps. Any mp3 file you add to Dropbox/Apps/JustCast/podcast_name will automatically go in your podcast’s RSS feed. All you have to do is tell people to subscribe to the feed. Use the in-built metrics feature to track the subscriber and download count.

justcast-workflow

To publish the podcast on iTunes, visit this link for podcast submission and paste the link to your RSS feed there to proceed.

Now let’s talk money. You don’t have to shell out any if you’re content having just three of the most recent episodes showing up in the feed. For unlimited feed items, you have the Pro plan at $5/month.

Here’s something you should make a note of. Dropbox has some restrictions in place on file hosting and sharing. So once your podcast gathers momentum and your audience grows, you’ll need to consider upgrading your Dropbox account to keep up with the increasing number of file downloads.

@badbeef I use JustCast. It takes a dropbox folder and turns it into a Podcast source with little setup. https://t.co/ych9zAbbxn #heynow

— Bt (@mingistech) November 13, 2015

Even if starting a podcast is not in your plans, you can put JustCast to good use by turning it into a personal podcast playlist. Put any MP3 audio files you want to listen to into Dropbox as described above and use the RSS feed in your podcast client — just as you would with any other podcast.

Be mindful of copyright restrictions for any files you’re uploading to Dropbox.

12. Theme Your Dropbox with Orangedox

If you use Dropbox for work, you might want to tweak its interface to align with your brand. And that’s where Orangedox steps in. It gives you tools to add special touches to the Dropbox portal, such as you own logo and color scheme.

Orangedox also allows you to track the documents you have shared and get download stats for them. Note that only this feature is available in the Free Forever plan.

I’m in love with Orangedox! Let’s me track downloads from Dropbox folders…free! http://t.co/1yHN5vMxEC

— Shana Festa (@BookieMonsterSF) October 1, 2014

We must admit that Orangedox has not quite picked up steam despite being launched more than a year ago i.e. in 2014. But considering that there seem to be zero apps that allow you to theme Dropbox, Orangedox is still worth a shot.

13. Create Photo Galleries Using Dropbox Photos with Photoshoot

Okay. We admit that we’re cheating a bit here. You already know of apps that turn your Dropbox photos into galleries. But we had to include Photoshoot in this list because it makes the process so easy.

You drag and drop photos into Dropbox and Photoshoot takes care of creating the gallery, complete with items like thumbnails, titles, dates, and a lightbox display. You can leave the gallery visible to the public or hide it behind a password.

sample-photoshoot-gallery

Professional photographers will get the most out of Photoshoot. If you are one, you’ll be happy to know that the app gives you options to use a custom domain, add your logo, theme the gallery with your brand’s colors, etc. You can even add links to your social networks.

The verdict is that if you’re looking for a hassle-free and elegant way to show off your best work, you’ll fall in love with Photoshoot. Check out a sample gallery here.

14. Skip File Display and Go Straight to File Download

When you click on a Dropbox link you have received, your browser displays the file and gives you an option to download it. But you can force your browser to start downloading the file immediately instead of displaying it first. To do so, you’ll have to change the dl=0 query parameter in the shared link to dl=1.

Let’s say the Dropbox link reads www.dropbox.com/…/URL.webloc?dl=0. Copy-paste it in your browser, change the dl=0 bit at the end of the link text to dl=1 (www.dropbox.com/…/URL.webloc?dl=1) and then hit Enter. Your browser will begin downloading the file right away.

TIL can load files from Dropbox in Safari/iOS ???? pic.twitter.com/ZXJCGiWSEU

— Ricardo Cabello (@mrdoob) October 29, 2015

15. Put Dropbox in a Menu Bar Panel with App Box for Dropbox [Mac]

Want quick access to your Dropbox folders without having to switch to a new Finder window on OS X? The lightweight App Box for Dropbox can help you with that. For $0.99 it places your Dropbox inside a panel that you can display with a single click from the menu bar. Sounds basic? It is. Sounds useful? It’s that too. We wish Windows also had something similar to put the whole of Dropbox in a pop-up panel accessible from the system tray.

Note that there are other similarly named versions of this app in the Mac App Store and they have a similar functionality. It’s not clear if they come from the same developer though. One of the versions is even free. Do your research before you install the app.

What’s in Store for Dropbox in 2016?

From Dropbox tools for the power user to Dropbox etiquette to time-saving Dropbox shortcuts, we poured everything we knew about Dropbox into article after article. And we thought we had covered it all. We were wrong. As you can see, Dropbox is keeping us on our toes and giving us fodder for more articles. We hope it keeps up this pace in future. Happy Dropboxing”!

Have you been using some of the new features introduced by Dropbox in 2015? Which Dropbox tricks or apps have you come across lately? Give us your best Dropbox tips in the comments.

February 1, 2016 google gmail

How to Find Out the Exact Date You Created Your Gmail Account

Source

Knowing the date you created your Gmail account can come in really handy if you ever have to go through Gmail’s account recovery process. Here’s how to find that exact date.

Search for Gmail’s Welcome email in your inbox and see its timestamp to know when you created your Gmail account. You might have deleted that mail like many of us have. In that case, try this other method.

Click on the gear icon below your profile picture and go to Settings > Forwarding and POP/IMAP. Under POP Download, look for Status. If you haven’t tampered with this section at all, you’ll see this message: POP is enabled for all mail that has arrived since followed by the date you created your Gmail account. Note it down in a safe place.

gmail-creation-date

If you have fiddled with the POP settings before, you’ll most likely see a blank space instead of the account creation date. This means that you’re out of options for now at least.

There was a third method floating on the web involving the creation a Google Takeout archive, but that no longer seems to work. How we wish it did!

Ah! I just found my Gmail account creation date: 7/26/2004. Quite a while ago!

— Andrew @ PAX (@coulombe) May 30, 2013

Have you ever lost your Gmail account? How did the account recovery process go? Do you know of any other method to recover the account creation date for Gmail?

Image Credit: Save the Date on a wood cube by Gustavo Frazao via Shutterstock

February 1, 2016 github hosting

A Guide to Creating and Hosting a Personal Website on GitHub | Jonathan McGlone

Source

A step-by-step beginner’s guide to creating a personal website and blog using Jekyll and hosting it for free using GitHub Pages.

View Demo Site   Download Demo Files

This guide is meant to help Git and GitHub beginners get up and running with GitHub Pages and Jekyll in an afternoon. It assumes you know very little about version control, Git, and GitHub. It is helpful if you know the basics of HTML and CSS since we’ll be working directly with these languages. We’ll also be using a little bit of Markdown, but by no means do you need to be an expert with any of these languages. The idea is to learn by doing, so the code we’ll be implementing in this tutorial is available in this guide or can be downloaded entirely at this GitHub repo. Feel free to copy and paste or type this code directly into your project’s files.

For a little background on why I chose GitHub and GitHub Pages for my personal website (and other projects), see this note.

Other Resources You Should Know

In order to make GitHub Pages accessible to a wider audience, this guide focuses on using the web interface on github.com to build your personal website, thereby generalizing and glossing over the standard tools associated with Git and GitHub. To get a lot dirtier wit Git and GitHub (ie, the command line and terminal), there are several other great guides you should also know about, probably bookmark, and read after completing this one, or jump over to if that is more your speed: Anna Debenham, Thinkful, and even GitHub itself go above and beyond making the command line or local workflow of GitHub hosting and Jekyll templates accessible to a wider audience.

Also, at the end of this document, there is a pretty good list of resources related to Git, GitHub/Pages, Jekyll, and Markdown that can help you dive deeper into these tools. I’ll do my best to keep this list updated as I find new ones.

What is Git, GitHub, and GitHub Pages?

Git, GitHub, and GitHub Pages are all very closely related. Imagine Git as the workflow to get things done and GitHub and GitHub Pages as places to store the work you finish. Projects that use Git are stored publicly in GitHub and GitHub Pages, so in a very generalized way, Git is what you do locally on your own computer and GitHub is the place where all this gets stored publicly on a server.

Git

Git is a version control system that tracks changes to files in a project over time. It typically records what the changes were (what was added? what was removed from the file?), who made the changes, notes and comments about the changes by the changer, and at what time the changes were made. It is primarily used for software development projects which are often collaborative, so in this sense, it is a tool to help enable and improve collaboration. However, its collaborative nature has led it to gain interest in the publishing community as a tool to help in both authoring and editorial workflows.

Git is for people who want to maintain multiple versions of their files in an efficient manner and travel back in time to visit different versions without juggling numerous files along with their confusing names stored at different locations. Think of Git and version control like a magic undo button.

In the diagram below, each stage represents a save”. Without Git, you cannot go back to any of the in between stages from the initial draft and final draft. If you wanted to change the opening paragraph in the final draft, you’d have to delete data that you couldn’t recover. To work around this, we use the save as” option, name it something different, delete the opening paragraph and start writing a new one.

With Git, the flow is multidirectional. Each change that is significant is marked as important in a version, and you proceed. If you you need to get back to earlier stages, you can without any loss of data. Presently, Google Docs revision history” or Wikipedia’s edit history” work in this sort of fashion. Git is just a lot more detailed and can get a lot more complex if needed.1

When you have the chance, I highly recommend this 15 minute, hands-on web tutorial on using Git.

GitHub

GitHub is a web hosting service for the source code of software and web development projects (or other text based projects) that use Git. In many cases, most of the code is publicly available, enabling developers to easily investigate, collaborate, download, use, improve, and remix that code. The container for the code of a specific project is called a repository.

There are thousands of really cool and exciting repositories on GitHub, with new ones added every day. Some examples of popular software development projects that make their code available on GitHub include:

  • Twitter Bootstrap, an extremely popular front-end framework for mobile first websites, created by developers at Twitter.
  • HTML5 Boilerplate, a front-end template for quickly building websites,
  • The JavaScript Visualization Library D3
  • Ruby on Rails, the open-source web framework built on Ruby.

Usually, people just host the files that contain their code, so what you see on the end view is the actual code, as in this example from the Ruby on Rails project:

GitHub Pages

GitHub Pages are public webpages hosted for free through GitHub. GitHub users can create and host both personal websites (one allowed per user) and websites related to specific GitHub projects. Pages lets you do the same things as GitHub, but if the repository is named a certain way and files inside it are HTML or Markdown, you can view the file like any other website. GitHub Pages is the self-aware version of GitHub. Pages also comes with a powerful static site generator called Jekyll, which we’ll learn more about later on.

Getting Started with GitHub Pages

Don’t worry if some of these concepts are still a little fuzzy to you. The best way to learn this stuff is to just start doing the work, so let’s not waste anymore time and dive right in.

1Create your project’s repository. Login to your GitHub account and go to https://github.com/new or click the New repository icon from your account homepage.

2 Name your repository username.github.io, replacing username with your GitHub username. Be sure it is public and go ahead and tell GitHub to create a README file upon generating the repo.

3 Create an index.html page by clicking the plus icon next to your repository name and typing the file name directly in the input box that appears.

On the resulting page, put this markup inside of the GitHub text editor:

<!DOCTYPE html>
<html>
    <head>
        <title>Hank Quinlan, Horrible Cop</title>
    </head>
    <body>
        <nav>
            <ul>
                <li><a href="/">Home</a></li>
                <li><a href="/about">About</a></li>
                <li><a href="/cv">CV</a></li>
                <li><a href="/blog">Blog</a></li>
            </ul>
        </nav>
        <div class="container">
            <div class="blurb">
                <h1>Hi there, I'm Hank Quinlan!</h1>
                <p>I'm best known as the horrible cop from <em>A Touch of Evil</em> Don't trust me. <a href="/about">Read more about my life...</a></p>
            </div><!-- /.blurb -->
        </div><!-- /.container -->
        <footer>
            <ul>
                <li><a href="mailto:hankquinlanhub@gmail.com">email</a></li>
                <li><a href="https://github.com/hankquinlan">github.com/hankquinlan</a></li>
            </ul>
        </footer>
    </body>
</html>

4 Commit index.html. At the bottom of the page, there is a text input area to add a description of your changes and a button to commit the file.

Congrats! You just built your first GitHub Pages site. View it at http://username.github.io. Usually the first time your GitHub Pages site is created it takes 5-10 minutes to go live, so while we wait for that to happen, let’s style your otherwise plain HTML site.

5 To style the content go back to your repository home and create a new file named css/main.css. The css/ before the filename will automatically create a subdirectory called css. Pretty neat.

Place the following inside main.css:

body &lbrace;
    margin: 60px auto;
    width: 70%;
&rbrace;
nav ul, footer ul &lbrace;
    font-family:'Helvetica', 'Arial', 'Sans-Serif';
    padding: 0px;
    list-style: none;
    font-weight: bold;
&rbrace;
nav ul li, footer ul li &lbrace;
    display: inline;
    margin-right: 20px;
&rbrace;
a &lbrace;
    text-decoration: none;
    color: #999;
&rbrace;
a:hover &lbrace;
    text-decoration: underline;
&rbrace;
h1 &lbrace;
    font-size: 3em;
    font-family:'Helvetica', 'Arial', 'Sans-Serif';
&rbrace;
p &lbrace;
    font-size: 1.5em;
    line-height: 1.4em;
    color: #333;
&rbrace;
footer &lbrace;
    border-top: 1px solid #d5d5d5;
    font-size: .8em;
&rbrace;

ul.posts &lbrace;
    margin: 20px auto 40px;
    font-size: 1.5em;
&rbrace;

ul.posts li &lbrace;
    list-style: none;
&rbrace;

Don’t forget to commit the new CSS file!

6 Link to your CSS file inside your HTML document’s <head>. Go back to index.html and select the Edit” button.

Add a link to main.css (new markup is in bold):

<!DOCTYPE html>
<html>
    <head>
        <title>Hank Quinlan, Horrible Cop</title>
        **<!-- link to main stylesheet -->**
        **<link rel="stylesheet" type="text/css" href="/css/main.css">**
    </head>
    <body>
        <nav>
            <ul>
                <li><a href="/">Home</a></li>
                <li><a href="/about">About</a></li>
                <li><a href="/cv">CV</a></li>
                <li><a href="/blog">Blog</a></li>
            </ul>
        </nav>
        <div class="container">
            <div class="blurb">
                <h1>Hi there, I'm Hank Quinlan!</h1>
                <p>I'm best known as the horrible cop from <em>A Touch of Evil</em> Don't trust me. <a href="/about">Read more about my life...</a></p>
            </div><!-- /.blurb -->
        </div><!-- /.container -->
        <footer>
            <ul>
                <li><a href="mailto:hankquinlanhub@gmail.com">email</a></li>
                <li><a href="https://github.com/hankquinlan">github.com/hankquinlan</a></li>
            </ul>
        </footer>
    </body>
</html>

Visit http://username.github.io to see your styled website. It should look at like http://hankquinlan.github.io.

Using Jekyll with GitHub Pages

Like GitHub Pages, Jekyll is self-aware, so if you add folders and files following specific naming conventions, when you commit to GitHub, Jekyll will magically build your website.

While I recommend setting up Jekyll on your own computer so you can edit and preview your site locally, and when ready, push those changes to your GitHub repo, we’re not going to do that. Instead, to quickly get a handle on how Jekyll works, we’re going to build it into our GitHub repo using the GitHub web interface.

What is Jekyll?

Jekyll is a very powerful static site generator. In some senses, it is a throwback to the days of static HTML before databases were used to store website content. For simple sites without complex architectures, like a personal website, this is a huge plus. When used alongside GitHub, Jekyll will automatically re-generate all the HTML pages for your website each time you commit a file.

Jekyll makes managing your website easier because it depends on templates. Templates (or layouts in Jekyll nomenclature) are your best friend when using a static site generator. Instead of repeating the same navigation markup on every page I create, which I’d have to edit on every page if I add, remove, or change the location of navigation item, I can create what Jekyll calls a layout that gets used on all my pages. In this tutorial, we’re going to create two Jekyll templates to help power your website.

Setting Up Jekyll on github.com

In order for Jekyll to work with your site, you need to follow Jekyll’s directory structure. To learn about this structure, we’re going to build it right into our GitHub repo.

7 Create a .gitignore file. This file tells Git to ignore the _site directory that Jekyll automatically generates each time you commit. Because this directory and all the files inside are written each time you commit, you do not want this directory under version control.

Add this simple line to the file:

    _site/

8 Create a _config.yml file that tells Jekyll some basics about your project. In this example, we’re telling Jekyll the name of our site and what version of Markdown we’d like to use:

    name: Hank Quinlan, Horrible Cop
    markdown: kramdown

At this point, I’m hopeful that you’ve got the hang of creating files and directories using the GitHub web interface, so I’m going stop using screenshots to illustrate those actions.

9Make a _layouts directory, and create file inside it called default.html. (Remember, you can make directories while making new files. See the main.css step if you forgot.)

This is our main layout that will contain repeated elements like our <head> and <footer>. Now we won’t have to repeat that markup on every single page we create, making maintenance of our site much easier. So let’s move those elements from index.html into default.html to get something that looks like this in the end:

<!DOCTYPE html>
    <html>
        <head>
            <title>&lbrace;&lbrace; page.title &rbrace;&rbrace;</title>
            <!-- link to main stylesheet -->
            <link rel="stylesheet" type="text/css" href="/css/main.css">
        </head>
        <body>
            <nav>
                <ul>
                    <li><a href="/">Home</a></li>
                    <li><a href="/about">About</a></li>
                    <li><a href="/cv">CV</a></li>
                    <li><a href="/blog">Blog</a></li>
                </ul>
            </nav>
            <div class="container">

            &lbrace;&lbrace; content &rbrace;&rbrace;

            </div><!-- /.container -->
            <footer>
                <ul>
                    <li><a href="mailto:hankquinlanhub@gmail.com">email</a></li>
                    <li><a href="https://github.com/hankquinlan">github.com/hankquinlan</a></li>
                </ul>
            </footer>
        </body>
    </html>

Take note of the &lbrace;&lbrace; page.title &rbrace;&rbrace; and &lbrace;&lbrace; content &rbrace;&rbrace; tags in there. They’re what Jekyll calls liquid tags, and these are used to inject content into the final web page. More on this in a bit.

10 Now update your index.html to use your default layout:

---
layout: default
title: Hank Quinlan, Horrible Cop
---
<div class="blurb">
    <h1>Hi there, I'm Hank Quinlan!</h1>
    <p>I'm best known as the horrible cop from <em>A Touch of Evil</em> Don't trust me. <a href="/about">Read more about my life...</a></p>
</div><!-- /.blurb -->

Notice the plain text at the top of the file. Jekyll calls this the Front-matter. Any file on your site that contains this will be processed by Jekyll. Every time you commit a file that specifies layout: default at the top, Jekyll will magically generate the full HTML document by replacing &lbrace;&lbrace; content &rbrace;&rbrace; in _layouts/default.html with the contents of the committed file. Awesome!

Setting up a Blog

A Jekyll-based blog uses the same conventions that we’ve familiarized ourselves with in the previous steps, but takes things further by adding a few more for us to follow. Jekyll is very flexible allowing you to extend your site as you wish, but in this guide we’re only going to cover the basics: creating a post, making a page to list our posts, creating a custom permalink for posts, and creating an RSS feed for the blog.

We’ll want to create a new layout for our blog posts called post.html and a folder to store each individual post called _posts/.

11 Start by creating the layout. Create a file named post.html in your _layouts folder. Notice the post layout uses the default layout as it’s base, and adds a couple new liquid tags to print the title of the post and date:

---
layout: default
---
<h1>&lbrace;&lbrace; page.title &rbrace;&rbrace;</h1>
<p class="meta">&lbrace;&lbrace; page.date | date_to_string &rbrace;&rbrace;</p>

<div class="post">
  &lbrace;&lbrace; content &rbrace;&rbrace;
</div>

12 Make a _posts/ directory where we’ll store our blog posts. Inside that folder will be our first post. Jekyll is very strict with how these files are named, so pay attention. It must follow the convention YYYY-MM-DD-title-of-my-post.md. This file name gets translated into the permalink for the blog post. So in this example, we’ll create a file named 2014-04-30-hank-quinlan-site-launched.md:

---
layout: post
title: "Hank Quinlan, Horrible Cop, Launches Site"
date: 2014-04-30
---

Well. Finally got around to putting this old website together. Neat thing about it - powered by &lbrack;Jekyll&rbrack;&lpar;http://jekyllrb.com&rpar; and I can use Markdown to author my posts. It actually is a lot easier than I thought it was going to be.

Note the file extension .md stands for Markdown, and the Markdown syntax used inside the file gets converted to HTML by Jekyll. Like Wikitext, Markdown is a markup language with a syntax closer to plain text. The idea of Markdown is to get out of the author’s way so they can write their HTML content quickly, making Markdown very suitable as a blog authoring syntax. If you aren’t already, you’ll want to get familiar with Markdown syntax, and this printable cheatsheet (PDF) will be your best friend.

After committing the new post, navigate to http://username.github.io/YYYY/MM/DD/name-of-your-post to view it.

All this is great, but your readers won’t always know the exact URLs of your posts. So next we need to create a page on our site that lists each post’s title and hyperlink. You could create this list on your homepage or alternatively, create a blog subpage that collects all of your posts. We’re going to do the latter.

13 Create a blog directory and create a file named index.html inside it. To list each post, we’ll use a foreach loop to create an unordered list of our blog posts:

---
layout: default
title: Hank Quinlan's Blog
---
    <h1>&lbrace;&lbrace; page.title &rbrace;&rbrace;</h1>
    <ul class="posts">

      &lbrace;&percnt; for post in site.posts &percnt;&rbrace;
        <li><span>&lbrace;&lbrace; post.date | date_to_string &rbrace;&rbrace;</span> » <a href="&lbrace;&lbrace; post.url &rbrace;&rbrace;" title="&lbrace;&lbrace; post.title &rbrace;&rbrace;">&lbrace;&lbrace; post.title &rbrace;&rbrace;</a></li>
      &lbrace;&percnt; endfor &percnt;&rbrace;
    </ul>

Now checkout http://username.github.io/blog/. You should see your first post listed and linked there. Nice job!

Customizing Your Blog

We’ve only begun to scratch the surface with Jekyll’s blog aware functionality. This guide is only going to cover a couple more steps you might want to take for your blog.

You may have noticed that the URL of your blog post does not include the blog directory in it. In Jekyll we can control the structure of our permalinks by editing the _config.yml file we created in Step 8. So let’s change our permalink structure to include the blog directory.

14 Edit the _config.yml file. Add the following line at the end of the file:

permalink: /blog/:year/:month/:day/:title

Now you’re blog posts will live at http://username.github.io/blog/YYYY/MM/DD/name-of-your-post.

It is also very easy to setup an RSS feed for your blog. Every time you publish a new post, it will get added to this RSS file.

15 Inside your blog/ directory create a file and name it atom.xml. Add this to file:

---
layout: feed
---
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <title>Hank Quinlan's Blog</title>
    <link href="http://hankquinlan.github.io/blog/atom.xml" rel="self"/>
    <link href="http://hankquinlan.github.io/blog"/>
    <updated>&lbrace;&lbrace; site.time | date_to_xmlschema &rbrace;&rbrace;</updated>
    <id>http://hankquinlan.github.io/blog</id>
    <author>
        <name>Hank Quinlan</name>
        <email>hankquinlanhub@gmail.com</email>
    </author>

    &lbrace;&percnt; for post in site.posts &percnt;&rbrace;
        <entry>
            <title>&lbrace;&lbrace; post.title &rbrace;&rbrace;</title>
            <link href="http://hankquinlan.github.io&lbrace;&lbrace; post.url &rbrace;&rbrace;"/>
            <updated>&lbrace;&lbrace; post.date | date_to_xmlschema &rbrace;&rbrace;</updated>
            <id>http://hankquinlan.github.io/&lbrace;&lbrace; post.id &rbrace;&rbrace;</id>
            <content type="html">&lbrace;&lbrace; post.content | xml_escape &rbrace;&rbrace;</content>
        </entry>
    &lbrace;&percnt; endfor &percnt;&rbrace;

</feed>

Now you can include a link to your RSS feed somewhere on your site for users to subscribe to your blog in their feed aggregator of choice. Navigate to http://username.github.io/blog/atom.xml to view your feed.

Note: In Chrome, your feed might look like an error, but it isn’t. Chrome doesn’t know how to display XML.

Wrapping Up

16 You’re almost done! Don’t forget to create and commit your about/index.html and cv/index.html pages. Since I’m sure you’ve got the hang of things now, I’ll back off and let you get these pages finished on your own.

17 Before going any further, take the time to setup Git and Jekyll on your own computer. This tutorial is all about Git in the web browser, so really it’s only the half way point. You’re going to have to do this if you want to be able to upload image or PDF files to your GitHub repo. GitHub’s tutorials and desktop application make local setup easy, and now that you know many of Git and GitHub’s basic concepts, you should be able to get this going. Go do it!

Next Steps

Hopefully this guide has given you the confidence to do many other things with Git, GitHub, Jekyll, and your website or blog. You could go in many different directions at this point, as I’m sure you’ve already started thinking about, but here are a few other things I think would be worth your time:

Resources

I’ll try to keep this list current and up to date. If you know of a great resource you’d like to share or notice a broken link, please get in touch.

Git, GitHub, and GitHub Pages

Jekyll

Markdown


Notes

1. Somasundaram, R. (2013). Git: Version Control for Everyone (pp. 9-17). Birmingham, UK: Packt Publishing.


← Newer Entries Older Entries →