本文共 7022 字,大约阅读时间需要 23 分钟。
aeronet 数据下载
I use data from the network of sun photometers a lot in my work, and do a lot of processing of the data in Python. As part of this I usually want to load the data into pandas – but because of the format of the data, it’s not quite as simple as it could be.
我在工作中使用来自太阳光度计的网络中的数据,并使用Python进行大量数据处理。 作为其中的一部分,我通常希望将数据加载到大熊猫中-但是由于数据的格式,它并不像它可能的那么简单。
So, for those of you who are impatient, here is some code that reads an AERONET data file into a pandas DataFrame which you can just download and use:
因此,对于那些急躁的人,下面的代码将AERONET数据文件读入pandas DataFrame中,您可以下载并使用它们:
400: Invalid request
For those who want more details, read on…
对于那些想要更多细节的人,请继续阅读...
Once you’ve downloaded an AERONET data file and unzipped it, you’ll find you have a file called something like 050101_161231_Chilbolton.lev20
, and if you look at the start of the file it’ll look a bit like this:
下载AERONET数据文件并将其解压缩后,您会发现文件名为050101_161231_Chilbolton.lev20
的文件,如果您看一下文件的开始,它将看起来像这样:
Level 2.0. Quality Assured Data.The following data are pre and post field calibrated, automatically cloud cleared and manually inspected.Version 2 Direct Sun AlgorithmLocation=Chilbolton,long=-1.437,lat=51.144,elev=88,Nmeas=13,PI=Iain_H._Woodhouse_and_Judith_Agnew_and_Judith_Jeffrey_and_Judith_Jeffery,Email=fsf@nerc.ac.uk_and__and__and_judith.jeffery@stfc.ac.ukAOD Level 2.0,All Points,UNITS can be found at,,, http://aeronet.gsfc.nasa.gov/data_menu.htmlDate(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle10:10:2005,12:38:46,283.526921,N/A,0.079535,0.090636,0.143492,N/A,N/A,N/A,N/A,N/A,0.246959,N/A,N/A,0.301443,N/A,0.373063,0.430350,2.115728,N/A,0.043632,0.049923,0.089966,N/A,N/A,N/A,N/A,N/A,0.116690,N/A,N/A,0.196419,N/A,0.181772,0.532137,N/A,1.776185,1.495202,1.757222,1.808187,1.368259,N/A,17/10/2006,58.758553
You can see here that we have a few lines of metadata at the top of the file, including the ‘level’ of the data (AERONET data is provided at three levels, 1.0, 1.5 and 2.0, referring to the quality assurance of the data), and some information about the AERONET site.
您可以在此处看到文件顶部有几行元数据,包括数据的“级别”(AERONET数据以1.0、1.5和2.0三个级别提供,指的是数据的质量保证),以及有关AERONET网站的一些信息。
In this function we’re just going to ignore this metadata, and start reading at the 5th line, which contains the column headers. Now, you’ll see that the data looks like a fairly standard CSV file, so we should be able to read it fairly easily with . This is true, and you can read it using:
在此功能中,我们将忽略该元数据,并从第5行开始阅读,该行包含列标题。 现在,您将看到数据看起来像是一个相当标准的CSV文件,因此我们应该可以使用轻松读取它。 的确如此,您可以使用以下方法阅读它:
df = pd.read_csv(filename, skiprows=4)
However, you’ll find a few issues with the DataFrame you get back from that simple line of code: firstly dates and times are just left as strings (rather than being parsed into proper datetime columns) and missing data is still shown as the string ‘N/A’. We can solve both of these:
但是,您会发现从该简单的代码行中获得的DataFrame的一些问题:首先,日期和时间只是作为字符串保留(而不是被解析为正确的datetime列),而丢失的数据仍然显示为字符串“不适用”。 我们可以解决这两个问题:
No data: read_csv allows us to specify how ‘no data’ values are represented in the data, so all we need to do is set this: pd.read_csv(filename, skiprows=4, na_values=['N/A']) Note:
we need to give na_values
a list of values to treat as no data, hence we create a single-element list containing the string N/A
.
无数据: read_csv允许我们指定如何在数据中表示“无数据”值,因此我们需要做的就是设置: pd.read_csv(filename, skiprows=4, na_values=['N/A']) Note:
我们需要给na_values
一个值列表,以将其视为无数据,因此我们创建了一个包含字符串N/A
的单元素列表。
Dates & times: These are a little harder, mainly because of the strange format in which they are provided in the file. Although the column header for the first column says Date(dd-mm-yy)
, the date is actually colon-separated (dd:mm:yy
). This is a very unusual format for a date, so pandas won’t automatically convert it – we have to help it along a bit. So, first we define a function to parse a date from that strange format into a standard Python datetime:
日期和时间:这些比较难一些,主要是因为文件中提供的日期格式奇怪。 尽管第一列的列标题为Date(dd-mm-yy)
,但日期实际上是用冒号分隔的( dd:mm:yy
)。 这是一种非常不常见的日期格式,因此大熊猫不会自动将其转换-我们需要一点帮助。 因此,首先我们定义一个函数,将一种奇怪的格式的日期解析为标准的Python日期时间:
dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")
I could have written this as a normal function (def dateparse(x)
), but I used a lambda
expression as it seemed easier for such a short function. Once we’ve defined this function we tell pandas to use it to parse dates (date_parser=dateparse
) and also tell it that the first two columns together represent the time of each observation, and they should be parsed as dates (parse_dates={ 'times':[0,1]}
).
我本可以将其编写为普通函数( def dateparse(x)
),但是我使用了lambda
表达式,因为这样的简短函数似乎更容易。 定义此函数后,我们告诉熊猫使用它来解析日期( date_parser=dateparse
),并且还告诉它前两列共同代表每次观察的时间,并且应该将它们解析为日期( parse_dates = { ' times ' :[ 0 , 1 ]}
)。
Putting all of this together, we get:
综合所有这些,我们得到:
dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")aeronet = pd.read_csv(filename, skiprows=4, na_values=['N/A'], parse_dates={'times':[0,1]}, date_parser=dateparse)
That’s all we need to do to read in the data and convert the right columns, the rest of the function just does some cleaning up:
我们要做的就是读取数据并转换正确的列,剩下的功能只是做一些清理:
dropna(axis=1, how='all')
does).dropna( axis = 1 , how = ' all ' )
所做dropna( axis = 1 , how = ' all ' )
)。 aeronet = aeronet.set_index('times')del aeronet['Julian_Day']# Drop any rows that are all NaN and any cols that are all NaN# & then sort by the indexan = (aeronet.dropna(axis=1, how='all') .dropna(axis=0, how='all') .rename(columns={'Last_Processing_Date(dd/mm/yyyy)': 'Last_Processing_Date'}) .sort_index())
You’ll notice that the last few bits of this ‘post-processing’ were done using ‘method-chaining’, where we just ‘chain’ pandas methods one after another. This is often a very convenient way to work in Python – see for more information.
您会注意到,此“后处理”的最后几位是使用“方法链接”完成的,在这里我们只是将熊猫方法彼此“链接”。 这通常是使用Python的一种非常方便的方法-有关更多信息,请参见此 。
翻译自:
aeronet 数据下载
转载地址:http://cgqwd.baihongyu.com/