Acessing the Internet in Go




Welcome to part 10 of the Go programming tutorial series, where we're going to be covering how to access the internet with Go and how to parse data from it. In our case, we're going to focus on getting data from a website's sitemap.

For this project, you can use any sitemap you want, you do not need to use my exact example. Furthermore, pretty much every time I've used any outside website, eventually, things change. This exact code may stop working for some reason, because the sitemap structure has changed...etc. Be prepared to find another website's sitemap. Generally, you can find a website's sitemap by going to thewebsite.com/robots.txt. In that file, you will usually find a line that tells you where the sitemap is. Pretty much all major information-based websites have sitemaps, so feel free to use your favorite website rather than what I use.

For the sake of this tutorial, I am going to use The Washington Post's sitemap index, which is an XML document that links to categorized sitemaps which lead to actual articles. If you use your favorite website's sitemap, it might instead just link directly to content, rather than a sitemap > categorized sitemap > content.

To begin accessing data, let's consdier the pipeline. To access data online, you need to first make a request. That request produces a response and some data hopefully. If you'd like to learn more about the Response, you can check out the Golang docs for http.Get's Response.

Okay, so, you make a request, you get a response, within that response you can get the content itself, which is in byte form. To then begin working with it, generally, you will want to convert the byes to string. Finally, you'll want to close that response.Body to make sure we free up the resources. To print the data out, we'll use fmt, to make the request we'll use net/http and to read the data, we'll need to import io/ioutil:

package main

import ("fmt"
"io/ioutil"
"net/http")

Now, let's begin to make our request. To do that, we can use http.Get, which will return the response and any error we might receieve. With the Go programming language, we can't define variables that we wont use, which can be unfortunate with function returns. I think it makes a lot of sense with many variable definitions, but it's odd with all function returns. For example, to use this code, you need to do:

resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")

If you see this code, you have no idea what that second return is, and you need to go check the docs or use godoc just to figure out what the heck it is (godoc net/http Get, for example). I'd rather do

resp, err := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")

That would work if we checked if the error was not nil, which you probably will be doing most of the time, but I still think it's odd. Anyway, you don't care. Back to work. So we'll start with

resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")

Now that we have this response, we want to access the body within, which we can find by resp.Body. That said, we will need to read this in with ioutil.ReadAll():

bytes, _ := ioutil.ReadAll(resp.Body)

Once we have this, we can just simply convert the body data to a string with string():

string_body := string(bytes)

Now we could visualize this information with:

fmt.Println(string_body)

Don't forget to close to free up resources:

resp.Body.Close()

All together now:

package main

import ("fmt"
"io/ioutil"
"net/http")

func main() {
	resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
	bytes, _ := ioutil.ReadAll(resp.Body)
	string_body := string(bytes)
	fmt.Println(string_body)
	resp.Body.Close()
}

The output here should be a string of the source code of the website you put in the http.Get(). You can check out the request docs that I linked to at the beginning of the tutorial, but, for example, if you wanted to get the response status, for example, you could do: fmt.Println(resp.Body). This would be useful since you might not get an error on the Get request, but your response status could be a 404, or some other response code specifically that you might want to handle for. In our case, this is a 200, which is a general "the request worked" response status.

In our case, the output is

   <sitemap>
      <loc>http://www.washingtonpost.com/news-blogs-entertainment-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-blogs-goingoutguide-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-goingoutguide-sitemap.xml</loc>
   </sitemap>

So now we need to begin parsing through this, which is what we'll be covering in the next tutorial.

The next tutorial:





  • Introduction to the Go Programming Language
  • Go Language Syntax
  • Go Language Types
  • Pointers in Go Programming
  • Simple Web App in Go Programming
  • Structs in the Go Programming Language
  • Methods in Go Programming
  • Pointer Receivers in Go Programming
  • More Web Dev in Go Language
  • Acessing the Internet in Go
  • Parsing XML with Go Programming
  • Looping in Go Programming
  • Continuing our Go Web application
  • Mapping in Golang
  • Mapping Golang sitemap data
  • Golang Web App HTML Templating
  • Applying templating to our Golang web app
  • Goroutines - Concurrency in Goprogramming
  • Synchronizing Goroutines - Concurrency in Golang
  • Defer - Golang
  • Panic and Recover in Go Programming
  • Go Channels - Concurrency in Go
  • Go Channels buffering, iteration, and synchronization
  • Adding Concurrency to speed up our Golang Web Application