Downloading Files from SharePoint via Graph Using Python
By Anatoly Mironov
Are you tired of manually downloading files from SharePoint? I’ve been on a quest to find an easy way to automate this process using Python and Graph API. Despite the lack of a dedicated SDK, I’ve crafted a script that simplifies the task. While it’s not production-ready, it’s a solid starting point for anyone looking to streamline their workflow.
Why Python?
Well, simply because it is part of python notebook for data ingestion into an AI Search. It will download all files I specify along with some metadata. Other alternatives would be using PnP PowerShell, or perhaps syncing or downloading an entire SharePoint Folder, but it’s not always you need all the documents. You might download the files one by one, but why would spend time on such a boring task when you can automate it.
The script
import requests | |
import urllib | |
graph_base_url = 'https://graph.microsoft.com/v1.0' | |
sharepoint_domain = 'takana17.sharepoint.com' | |
directory = 'DOWNLOAD-PATH' | |
bearer_token = '<BEARER-TOKEN-HERE>' | |
doc_abs_url = 'FILE-LINK' | |
doc_abs_url_clean = doc_abs_url.split('?')[0] | |
stripped_url = doc_abs_url_clean.split('/sites/')[1] | |
site_slug_split = stripped_url.split('/',2) | |
site_slug = site_slug_split[0] | |
drive_path = site_slug_split[1] | |
item_path = site_slug_split[2] | |
file_name = item_path.split('/')[-1] | |
file_name_unquoted = urllib.parse.unquote(file_name) | |
graph_url_site_id = f'{graph_base_url}/sites/{sharepoint_domain}:/sites/{site_slug}?select=id' | |
response = requests.get(graph_url_site_id, headers={'Authorization': f'Bearer {bearer_token}'}) | |
site_id = response.json().get('id') | |
site_id_short = site_id.split(',')[1] | |
graph_url_drives = f'{graph_base_url}/sites/{site_id_short}/drives?select=id,webUrl' | |
response = requests.get(graph_url_drives, headers={'Authorization': f'Bearer {bearer_token}'}) | |
drives = response.json().get('value') | |
drive_info = next((item for item in drives if item["webUrl"].endswith(drive_path)), None) | |
drive_id = drive_info.get('id') | |
graph_url_item = f'{graph_base_url}/drives/{drive_id}/root:/{item_path}' | |
response_item = requests.get(graph_url_item, headers={'Authorization': f'Bearer {bearer_token}'}) | |
item_download_link = response_item.json().get('@microsoft.graph.downloadUrl') | |
local_file_path = f'{directory}{file_name_unquoted}' | |
response_download = requests.get(item_download_link) | |
with open(local_file_path, "wb") as file: | |
file.write(response_download.content) |
Thoughts
- The script is just for one file, the next step is to put it in a loop.
- For now it follows the “happy path”, error handling must be added
- The script omits creation of a bearer token, just to keep this post as simple as possible. I recommend creating a Service Principal, but to try out, you copy and paste a bearer token from the Graph Explorer.
- Why is there no endpoint in Graph for downloading a file based on the url?
- Is there really no python SDK for downloading files from SharePoint?
- Nomenclature: site is site, drive is a document library, item is information about a file.
- The chance is big that you download files from the same site and from the same “drive”, so the site_id and drive_id should be cached locally to avoid unnecessary calls to Graph risking to be throttled.
- Download links you get from Graph are pre-authenticated and short-lived.
- It’s best to put this script in a notebook.
Example
Consider this file url:
https://takana17.sharepoint.com/:w:/r/sites/spmt001/Shared%20Documents/ebooks2/languages/Chuvash/
%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B0%20%D0%9C.%D0%9A.
%20%E2%80%93%20%D0%97%D0%B0%D0%BC%D0%B5%D1%82%D0%BA%D0%B8%20%D0%BE%20%D0%BA%D0%B0%D1%82%D0%B5
%D0%B3%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D0%B0%D0%B4%D0%B5%D0%B6%D0%B0%20%D0%B2%20%D0%B3%D0%BE%D0%B2%D0%BE%D1%80%D0%B0%D1%85%20%D1%87%D1%83%D0%B2%D0%
B0%D1%88%D1%81%D0%BA%D0%BE%D0%B3%D0%BE%20%D1%8F%D0%B7%D1%8B%D0%BA%D0%B0%20(2009).doc
?d=wf4080cb1559a5aa29569b6c05d8aec51&csf=1&web=1&e=HLBWVR
First the script removes the “?” and everything after, then it gets the site url, the file path, and the file name in a readable way (%20 and so on - urllib.parse.unquote).
Then it gets the site id, the drive id, then it gets the item - information about the file, and finally it gets the file using a download link
It’s four requests that should be one.