ChatGPT解决这个技术问题 Extra ChatGPT

How is mime type of an uploaded file determined by browser?

I have a web app where the user needs to upload a .zip file. On the server-side, I am checking the mime type of the uploaded file, to make sure it is application/x-zip-compressed or application/zip.

This worked fine for me on Firefox and IE. However, when a coworker tested it, it failed for him on Firefox (sent mime type was something like "application/octet-stream") but worked on Internet Explorer. Our setups seem to be identical: IE8, FF 3.5.1 with all add-ons disabled, Windows XP SP3, WinRAR installed as native .zip file handler (not sure if that's relevant).

So my question is: How does the browser determine what mime type to send?

Please note: I know that the mime type is sent by the browser and, therefore, unreliable. I am just checking it as a convenience--mainly to give a more friendly error message than the ones you get by trying to open a non-zip file as a zip file, and to avoid loading the (presumably heavy) zip file libraries.

application/octet-stream designates a binary file. You should be able to get the extension of the file to see if it is a zip file. Just to clarify, did this work for you on FF, but not your co-worker?
yes, it worked for me in both browsers
take a look at input/@formenctypeor form/@enctype attributes

C
Community

Chrome

Chrome (version 38 as of writing) has 3 ways to determine the MIME type and does so in a certain order. The snippet below is from file src/net/base/mime_util.cc, method MimeUtil::GetMimeTypeFromExtensionHelper.

// We implement the same algorithm as Mozilla for mapping a file extension to
// a mime type.  That is, we first check a hard-coded list (that cannot be
// overridden), and then if not found there, we defer to the system registry.
// Finally, we scan a secondary hard-coded list to catch types that we can
// deduce but that we also want to allow the OS to override.

The hard-coded lists come a bit earlier in the file: https://cs.chromium.org/chromium/src/net/base/mime_util.cc?l=170 (kPrimaryMappings and kSecondaryMappings).

An example: when uploading a CSV file from a Windows system with Microsoft Excel installed, Chrome will report this as application/vnd.ms-excel. This is because .csv is not specified in the first hard-coded list, so the browser falls back to the system registry. HKEY_CLASSES_ROOT\.csv has a value named Content Type that is set to application/vnd.ms-excel.

Internet Explorer

Again using the same example, the browser will report application/vnd.ms-excel. I think it's reasonable to assume Internet Explorer (version 11 as of writing) uses the registry. Possibly it also makes use of a hard-coded list like Chrome and Firefox, but its closed source nature makes it hard to verify.

Firefox

As indicated in the Chrome code, Firefox (version 32 as of writing) works in a similar way. Snippet from file uriloader\exthandler\nsExternalHelperAppService.cpp, method nsExternalHelperAppService::GetTypeFromExtension

// OK. We want to try the following sources of mimetype information, in this order:
// 1. defaultMimeEntries array
// 2. User-set preferences (managed by the handler service)
// 3. OS-provided information
// 4. our "extras" array
// 5. Information from plugins
// 6. The "ext-to-type-mapping" category

The hard-coded lists come earlier in the file, somewhere near line 441. You're looking for defaultMimeEntries and extraMimeEntries.

With my current profile, the browser will report text/csv because there's an entry for it in mimeTypes.rdf (item 2 in the list above). With a fresh profile, which does not have this entry, the browser will report application/vnd.ms-excel (item 3 in the list).

Summary

The hard-coded lists in the browsers are pretty limited. Often, the MIME type sent by the browser will be the one reported by the OS. And this is exactly why, as stated in the question, the MIME type reported by the browser is unreliable.


thanks! do you have a link to the hard-coded list in the chrome source?
@Kip yeah, I've added a link. Firefox doesn't seem to have an (official) online source code browser, I had to download it from their FTP server.
Having the MIME as ms-excel for CSV is annoying, wonder why it isn't in the hardcoded list.
It would be nice to know if there were some updates in mime-type detection since 2014.
@VitalyIsaev a cursory glance at the Chrome code shows that this hasn't changed since 2014.
K
Kumar

Kip, I spent some time reading RFCs, MSDN and MDN. Here is what I could understand. When a browser encounters a file for upload, it looks at the first buffer of data it receives and then runs a test on it. These tests try to determine if the file is a known mime type or not, and if known mime type it will simply further test it for which known mime type and take action accordingly. I think IE tries to do this first rather than just determining the file type from extension. This page explains this for IE http://msdn.microsoft.com/en-us/library/ms775147%28v=vs.85%29.aspx. For firefox, what I could understand was that it tries to read file info from filesystem or directory entry and then determines the file type. Here is a link for FF https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIFile. I would still like to have more authoritative info on this.


M
Michael A. McCloskey

This is probably OS and possibly browser dependent, but on Windows, the MIME type for a given file extension can be found by looking in the registry under HKCR:

For example:

HKEY_CLASSES_ROOT.zip - ContentType

To go from MIME to file extension, you can look at the keys under

HKEY_CLASSES_ROOT\Mime\Database\Content Type

To get the default extension for a particular MIME type.


thanks. unfortunately, for both me and my coworker this appears to be correct in our registry. i guess that's why it worked in IE for him, but FF is getting it differently somehow... oh well :(
j
johndodo

While this is not an answer to your question, it does solve the problem you are trying to solve. YMMV.

As you wrote, mime type is not reliable as each browser has its way of determining it. However, browsers send the original name (including extension) of the file. So the best way to deal with the problem is to inspect extension of the file instead of the MIME type.

If you still need the mime type, you can use your own apache's mime.types to determine it server-side.


Care to elaborate? In my experience browsers always send the correct original filename (with extension) while MIME types vary greatly. So yes, I would say it is far more reliable.
Correct. I meant to say that the end user can put any extension, regardless of actual type, so it should not be trusted.
That is true, but it doesn't matter if you use extension or MIME type - you should never trust user supplied input. But OP stated explicitly he is aware of this problem, so this is not part of this question. Btw, I would appreciate if you removed the downvote (I assume it came from you).
You're right, did not pay attention to the not in the question, my bad. I can cancel my vote but you'll have to edit the answer for that (enforced by the system)...
Yeah, I agree with johndodo. As Stijn explained in his answer above, Chrome and Firefox check the extension first. They are doing the same thing in the end.
S
Seul Shahkee

I agree with johndodo, there are so many variables that make mime types that are sent from browsers unreliable. I would exclude the subtypes that are received and just focus on the type like 'application'. if your app is php based, you can easily do this by using the function explode(). in addition, just check the file extension to make sure it is .zip or any other compression you are looking for!


s
smwikipedia

According to rfc1867 - Form-based file upload in HTML:

Each part should be labelled with an appropriate content-type if the media type is known (e.g., inferred from the file extension or operating system typing information) or as application/octet-stream.

So my understanding is, application/octet-stream is kind of like a blanket catch-all identifier if the type cannot be inferred.


yes, i understand all of this. the question was how does the browser infer.
That's worth knowing though, right? If application/octet-stream is the catch-all, then another approach would be to trust the browser if it has been able to make a guess, and do your own server side tests if you get application/octet-stream .