PHENIX Computing Center in Japan
(PHENIX CC-J) の採用技術
澤田真也(KEK)
市原卓、渡邊康(理研、理研BNL研究センター)
後藤雄二、竹谷篤、林直樹(理研)
延與秀人、四日市悟(京大)、浜垣秀樹(東大CNS)
99.9.25
JPS mtg @ Matsue
1
PHENIX CC-J
CC-J の構成要素





Linux farm
Data server
HPSS
Network
Misc. softwares &
tools
99.9.25
JPS mtg @ Matsue
2
PHENIX CC-J
Linux farm

Two boxes of AltaCluster http://www.altatech.com/products/clusters.html
– 16 nodes = 32 CPUs (will be doubled soon)
– Pentium II 450MHz (18.5 SpecINT95/cpu)
– Remote boot, remote monitoring, …




Linux RedHat5.2, Kernel 2.2.11 with NFSv3 Patch
PBS Batch Queuing System
Memory: 512 MB/node
Local Disk: 9-14 GB/node
– Benchmark test (Bonnie):write xxMB/s, read xxMB/s


NFS mounted RAID5 Disks on SUN E450
100BaseT NIC on each node & Catalyst 2948G (gigabit
Switching Hub)
99.9.25
JPS mtg @ Matsue
3
PHENIX CC-J
AltaCluster
99.9.25
JPS mtg @ Matsue
4
PHENIX CC-J
Data Server

SUN E450: 400 MHz 2 CPU, 1GB
Memory, 360GB Raid disk (One more
E450 will be purchased soon.)
– General ‘home’ machine

288GB Raid5 disk (1.6TB Raid5 will
be purchased soon.)
– Working space for users

Alteon Ace 180 Gigabit Switch
(Jumbo frame operation)
99.9.25
JPS mtg @ Matsue
5
PHENIX CC-J
RAID performance measurement


Preliminary measurement on 16 Apr 1999 (T. Ichihara RIKEN)
Hardware SUN E450 (Dual Ultra2 sparc, 400MHz, 1280 MB Memoy)
read
Bare Disk (Seagate 18GB
ST318275LC internal, ultrawide scsi)
Hardw are RAID4
288GB (IAI SNX-
write
13.33 MB/S( 200 MB 15. sec.) 14.3 MB/S ( 200 MB 14. sec.)
14.18 MB/S( 2000 MB 141. sec.) 12.58 MB/S( 2000 MB 159. sec.)
13.33 MB/S( 200 MB 15. sec.) 13.3 MB/S ( 200 MB 15. sec.)
14.18 MB/S( 2000 MB 141. sec.) 12.90 MB/S( 2000 MB 155. sec.)
960000ED-3200 cache
memory 32MB)
Hardw are RAID5
288GB (IAI SNX960000ED-3200 cache
memory 32MB)
Softw are RAID 0+1
16.7 MB/S ( 200 MB 12. sec.)
14.3 MB/S ( 200 MB 14. sec.)
Working
16.39 MB/S( 2000 MB 122. sec.) 12.99 MB/S( 2000 MB 154. sec.) area for
users
(Solaris2.6 Disksuite)
25.00 MB/S( 200 MB 8. sec.)
11.1 MB/S( 200 MB 18. sec.)
Home
23.26 MB/S( 2000 MB 86. sec.) 10.10 MB/S( 2000 MB 198. sec.) area for
users
Softw are RAID 5
(Solaris2.6 Disksuite )
28.57 MB/S( 200 MB 7. sec.)
1.9 MB/S ( 200 MB 102. sec.)
27.78 MB/S( 2000 MB 72. sec.) 1.80 MB/S ( 2000 MB 1111. sec.)
99.9.25
JPS mtg @ Matsue
6
PHENIX CC-J
NFS performance measurement

Test with bonnie(bonnie -s 100 : )
– from a Linux node to RAID on ccjsun with NFS
– ap14 (kernel 2.2.10)
Sequential Output (write)
Sequential Input (read)
Random
rewirte
per char
block
seeks
M/sec
per char
block
MB
K/sec
%CPU
K/sec
%CPU
K/sec
%CPU
K/sec
100
374
4.6
483
1.3
523
1.5
9712 100.0 254.0 99.2
571.5 6.1
Sequential Input (read)
Random
rewirte
per char
block
seeks
K/sec
K/sec
M/sec
%CPU
%CPU
/sec
%CPU
– ap15 (kernel 2.2.10 NFSv3)
Sequential Output (write)
per char
block
MB
K/sec
K/sec
100
6559 89.2
%CPU
%CPU
6554 16.3
%CPU
6791 19.1
%CPU
%CPU
/sec
%CPU
9650 100.0 252.6 101.1 1262.31
 Use NFSv3!
99.9.25
JPS mtg @ Matsue
7
PHENIX CC-J
HPSS (High Performance Storage System)








Hierarchical storage system
HPSS server (SP2 5-node 20-CPU, with SP switch and
Gigabit NIC)
144 GB HPSS Cache disk (SSA Raid5) + 288 GB Work disk
(Raid 5)
HPSS 4.1.1, AIX 4.3.2
STK Robot (4 RedWood drives, 100TB tape media)
Alteon Ace 180 Gigabit Switch (Jumbo frame operation)
Gigabit (jumbo frame) network and Hippi connection to
SUN/Linux
fpt or ‘pftp’ (parallel ftp) is used for data access between
HPSS and SUN/Linux nodes.
99.9.25
JPS mtg @ Matsue
8
PHENIX CC-J
Overview of HPSS-CCJ
HPSS
Monitor
①
ACSLS
10BaseT
SUN
STK
Ether for HPSS Control (100BaseT)
Internal
Ether
(10Base2)
Ether for PVR
(10BaseT)
1000Base
SX x 5
SP
①
SP Switch
CWS
Gigabit
Switch
SP Switch
Router
1000Base LX
10BaseT
HIPPI
100BaseT
x8
1000Base
SX
1000Base
SX
Gigabit
Switch
Gigabit
Switch
CPU Farms
SUN E450
Ether w/ Global Addresses
(100BaseT)
・・・
Internet
99.9.25
For Program
Development
JPS mtg @ Matsue
9
PHENIX CC-J
HPSS Hardware
STK
- 256MB Mem
- 4.5GB
× 2 HDD
7024-E30 Control
Workstation
- 256MB Mem
- 9.1GB
× 2 HDD
7043-240 Workstation
for HPSS Monitor
- 10/100Base-T,
8 Ports
- HIPPI, 1 Port
Wide Node(Tape Mover#2)
- 604e(332MHz) 4Ways
- 512MB Mem
- 4.5GB × 1 HDD
Wide Node(Tape Mover#1)
- 604e(332MHz) 4Ways
- 512MB Mem
- 4.5GB × 1 HDD
Wide Node(Disk Mover#2)
- 604e(332MHz) 4Ways
- 512MB Mem
- 4.5GB × 1 HDD
288GB
7133-020 Disk Subsystem
...
9.1GB × 16
Wide Node(Disk Mover#1)
- 604e(332MHz) 4Ways
- 512MB Mem
- 4.5GB × 1 HDD
288GB
Wide Node(HPSS SVR)
- 604e(332MHz) 4Ways
- 1GB Mem
- 4.5GB × 2 HDD
288GB
9077-04S SP Switch Router
99.9.25
Redwood
...
9.1GB × 8
...
9.1GB × 8
7015-R00 System Rack
JPS mtg @ Matsue
10
PHENIX CC-J
HPSS Software Configuration
- AIX 4.3.2
- PSSP 3.1
- C for AIX 4.4
7024-E30 Control
Workstation
- AIX 4.3.2
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.3
- ssh 1.2.26
- tcpwrapper 7.6
7043-240 Workstation
for HPSS Monitor
- AIX 4.3.2
- PSSP 3.1
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.4
- ssh 1.2.26
- tcpwrapper7.6
- Advantape
41.1.7.2
- AIX 4.3.2
- PSSP 3.1
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.4
- ssh 1.2.26
- tcpwrapper7.6
- Advantape
41.1.7.2
- AIX 4.3.2
- PSSP 3.1
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.4
- ssh 1.2.26
- tcpwrapper
7.6
- AIX 4.3.2
- PSSP 3.1
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.4
- ssh 1.2.26
- tcpwrapper
7.6
- AIX 4.3.2
- PSSP 3.1
- DCE 2.2
- Encina 4.2
- HPSS 4.1.1
- C for AIX 4.4
- ssh 1.2.26
- tcpwrapper
7.6
- Sammi 4.1.2
9076-550 POWER Parallel Server
99.9.25
JPS mtg @ Matsue
11
PHENIX CC-J
STK Tape Robot




99.9.25
Redwood drives:
~11MB/s/drive
Currently we have 4 drives.
Thus totally about 45MB/s
can be achieved.
50GB/cartridge *
2000cartridges = 100TB
Data (raw data and DSTs)
will be transported with tape
cartridges between RIKEN
and BNL.
JPS mtg @ Matsue
12
PHENIX CC-J
Network

LAN
– Gigabit ethernet with
jumbo frame (9kB
frame (normal: 1.5kB)
available on AIX 4.2 or
later) and HiPPI
– Gbit has a similar
performance with HiPPI
– Gbit will be used.

HPSS
Monitor
①
SUN
99.9.25
STK
Ether for HPSS Control (100BaseT)
Internal
Ether
(10Base2)
Ether for PVR
(10BaseT)
1000Base
SX x 5
SP
①
SP Switch
Gigabit
Switch
SP Switch
Router
CWS
1000Base LX
10BaseT
HIPPI
WAN
– HEPNET-J/SINET
between Japanese
institutions
– APAN between RIKEN
and ESnet sites (BNL
etc.)
ACSLS
10BaseT
100BaseT
x8
1000Base
SX
1000Base
SX
Gigabit
Switch
Gigabit
Switch
CPU Farms
SUN E450
Ether w/ Global Addresses
(100BaseT)
Internet
JPS mtg @ Matsue
・・・
For Program
Development
13
PHENIX CC-J
Network Performance

Test with netperf http://www.netperf.org/netperf/NetperfPage.html
– More study needed to get nearly Gbit performance
Gigabit Ethernet(Junbo Frame)経由
コア・サーバー → CCJSUN
Recv. Socket Size
Send. Socket Size
MB/s
CPU Usage(sys) CPU Usage(id)
262640
262144
46.8
22.7
76.6
CCJSUN → コア・サーバー
Recv. Socket Size
Send. Socket Size
MB/s
CPU Usage(sys) CPU Usage(id)
262144
262144
52.5
23.7
75.1
コア・サーバー → CCJSUN
Recv. Socket Size
Send. Socket Size
MB/s
CPU Usage(sys) CPU Usage(id)
262640
262144
42.5
27.9
69.8
CCJSUN → コア・サーバー
Recv. Socket Size
Send. Socket Size
MB/s
CPU Usage(sys) CPU Usage(id)
262144
262144
36.9
20
77.7
99.9.25
JPS mtg @ Matsue
14
PHENIX CC-J
Data Transfer Performance

Test results with pftp (parallel ftp) between Linux nodes and
HPSS
– 100BaseT on Linux limits the performance?
99.9.25
JPS mtg @ Matsue
15
PHENIX CC-J
WAN http://ccjsun.riken.go.jp/cgi-bin/ping_data_plot.pl




Remote Host is ns.bnl.gov
packet size is 100
from Fri Aug 20 0:19:10 Japan 1999 to Sun Aug
29 23:49:10 Japan 1999
There is a time tic every day
99.9.25




Remote Host is cnsuty.cns.s.u-tokyo.ac.jp
packet size is 100
from Fri Aug 20 0:19:09 Japan 1999 to Sun Aug
29 23:49:09 Japan 1999
There is a time tic every day
JPS mtg @ Matsue
16
PHENIX CC-J
Key Software

PBS: Batch Queuing System
– http://pbs.mrj.com/
– Free package developed mainly at NAS of NASA

AFS: File system with Kerberos
– Important files (source codes, libraries etc.) are on AFS at BNL.
– Mirroring from BNL

Monitoring: MRTG
– CPU, memory, disk usage of each node as well as
transmission rate via network
– http://www.ceres.dti.ne.jp/~riocat/webtools/mrtg/
– http://ccjsun.riken.go.jp/~yokkaich/mrtg/resourceWatch/index.
html
99.9.25
JPS mtg @ Matsue
17
PHENIX CC-J
PHENIX Software
99.9.25
JPS mtg @ Matsue
18
PHENIX CC-J
Summary




CC-Jを構成する「部品」は一通りそろった。
各部品および全体としてのさまざまな性能をチェックしている。
おおむね所期の性能を出している。(予定通りの数が入れば要
求を満たす。)
なお、細かい点でのバグ出し、性能の理解を進め、初期の要求
を満たす。
CPU
Disk storage
DiskI/O
Tape storage
TapeI/O
99.9.25
1999 要求
現時点
1999 予定
2001 要求
2400
SPECint95
5TB
200MB/s
100TB
67.5MB/s
592
SPECint95
0.3TB
~15MB/s
100TB
45MB/s
> 1184-1776
SPECint95
> 1.9TB
~50MB/s?
100TB
67.5MB/s?
10700
SPECint95
15TB
600MB/s
100TB
112.5MB/s
JPS mtg @ Matsue
19
ダウンロード

PHENIX Computing Center in Japan (PHENIX CC